AI Compute: Training vs Inference

🧠

Training

Teaching the model — building intelligence

Compute Scale 10,000-100,000 GPUs

Duration Weeks to months

Power per Run 50–100 MW

Cost per Run $100M–$1B+

GPU Utilization ~100% sustained

Frequency Periodic (new models)

Think of it as: Building a factory. Enormous upfront investment, runs intensely for a defined period, produces a reusable asset (the model weights).

⚡

Inference

Using the model — serving predictions

Query Volume Millions/day per model

Duration 24/7/365, always on

Power per Server 1–10 kW (but at scale…)

Fleet Power 200–500 MW+

GPU Utilization Variable, demand-driven

Growth Exponential (usage adoption)

Think of it as: Operating the factory 24/7. Low cost per unit, but the volume never stops growing. Every ChatGPT query, every Copilot suggestion, every API call is inference.

⚡ Power Demand at Scale

Single Training Cluster 50–100 MW

Large Inference Fleet 200–500 MW

Hyperscaler Campus (Training + Inference) 500 MW–2 GW

US Total AI Data Center Power (2027E) 35–45 GW

📊

The shift is underway. Training dominated early AI power demand (2022-2024). But as models deploy to billions of users, inference is rapidly becoming the majority of compute. By 2027, inference is expected to consume 60-70% of total AI compute power. This means sustained, growing baseload demand, not periodic spikes.

🏗️ Data Center Infrastructure Implications

🔌

Power

Time-to-power is the #1 bottleneck. Grid interconnection takes 3-7 years. TMI restart delayed to 2031 by PJM. Nuclear and gas backup required for baseload.

3–7 yr lead times

❄️

Cooling

Training clusters generate sustained high-density heat (50+ kW/rack). Inference is lower per-server but grows continuously. Both drive demand for liquid cooling, direct expansion, and precision air handling.

50+ kW/rack density

📈

Capex Scale

Hyperscalers committed $300B+ in 2025-2026 capex. Meta: 6.6 GW nuclear (Oklo + Vistra). Microsoft: TMI restart. Google: Kairos SMR. Amazon: Talen Energy nuclear deal.

$300B+ committed

⏳

Bottleneck

Transformers, switchgear, and electrical equipment have 2-4 year lead times. Grid modernization is the rate limiter. Demand far exceeds manufacturing capacity for critical power infrastructure.

2–4 yr equipment wait

Key Takeaway

Training gets the headlines, but inference is the infrastructure story.
It runs 24/7, scales with every new user, and its power demand compounds.
The companies that solve power delivery and cooling at scale will define the next decade.