Research paper · 001 · Open
Series · Proof-Efficient AI Volume · 01 Filed · May 2026 Status · Living document

Energy efficiency
is not a datacenter property.
It is a workload property.

Aggregate metrics — PUE, kWh per rack, MW per facility — describe a building. They describe almost nothing about the computation that ran inside it. This paper argues that the only honest unit of AI energy efficiency is kWh per useful workload outcome, captured at the hardware, cryptographically signed at the source, and verifiable forever.

The AI industry currently reports energy in facility-level aggregates: total megawatt-hours consumed, power usage effectiveness (PUE), water usage effectiveness (WUE), carbon emissions averaged over annual production. These figures are accounting artefacts. They cannot answer the question that actually matters to operators, regulators, procurement teams, and capital allocators: how much energy did this specific model, training run, or inference workload actually consume — and was that consumption efficient?

We argue that the unit of measurement must shift from the building to the workload; from self-reported to hardware-attested; from retrospective audit to signed at capture. We outline the measurement stack required to make this shift, the proof artefacts that result, and the operational feedback loop this creates — where workload telemetry no longer just monitors infrastructure, but informs how infrastructure should be built in the first place.

The frame problem

When a hyperscaler announces that its datacenter operates at a PUE of 1.10, what has actually been said? That for every unit of energy delivered to compute, 0.10 units were spent on cooling, lighting, power conversion, and overhead. It is a real number, and a useful one — for facility engineering. It tells the world almost nothing about whether the compute that consumed those joules was well-spent.

A PUE-optimised facility running a poorly-implemented attention kernel at 40% GPU utilisation is, in the only sense that should matter, less efficient than a PUE-1.4 facility running the same workload at 92% utilisation. The first looks excellent on a sustainability report. The second produced more useful computation per joule consumed. The reporting frame chose the wrong subject. The building is not the workload.

Worse: PUE is reported as an annualised average, smoothed across all workloads, all seasons, and all operating regimes. Two facilities with identical PUEs may be doing fundamentally different work. Two workloads in the same facility may have wildly divergent energy footprints. The aggregate hides everything that operators, regulators, and customers should be able to see.

You cannot manage what you do not measure, and you cannot trust what you cannot verify. Operational premise — Serial Alice Research

The unit of measurement

If the building is the wrong subject, what is the right one? We argue the only honest unit is energy per unit of useful work delivered. The definition of "useful work" varies by workload, and that is acceptable — even necessary. For training, the unit may be kWh per epoch at a target loss. For inference, kWh per million tokens or kWh per accepted completion. For batch jobs, kWh per record.

Canonical form
ηworkload  =  useful output energy consumed at capture
Both numerator and denominator must be measured at the workload boundary — not inferred from facility totals, not estimated from manufacturer spec sheets, not back-calculated from billing data. The denominator is read directly from NVML, IPMI, or PDU sensors at runtime; the numerator is defined by the workload contract.

The shift from facility metrics to workload metrics is not a refinement. It is a change of subject. It also changes who is accountable. A facility PUE belongs to the operator. A per-workload efficiency belongs to whoever wrote the code, who chose the model, who scheduled the run. That is uncomfortable. It is also where real engineering progress becomes possible.

Why averages lie

A 30-day rolling average of datacenter energy consumption is a flat line. The reality underneath is anything but. AI workloads are bursty, heterogeneous, and shape-dependent. Training jobs spike then idle. Inference clusters oscillate with diurnal request patterns. Fine-tuning runs vary by orders of magnitude depending on batch size, sequence length, and gradient accumulation strategy. The same model, deployed on the same hardware, can vary in joules-per-token by 4-7× between best-case and worst-case configurations.

Inter-workload variance
4-7×
kWh per million tokens, same model, same GPU, different configs.
Intra-workload variance
2.2×
Energy drift across a single training run as data distribution shifts.
Idle tax
12-28%
Energy consumed by GPUs allocated but not productively utilised.

Averages are not analytically neutral. They actively obscure the signal an operator most needs: which workloads are well-implemented and which are not. The well-implemented one and the broken one consume the same joules on the facility bill, but only the broken one is fixable. Without per-workload measurement, you cannot tell them apart.

The measurement stack

Measuring energy at the workload boundary is a systems problem. It requires simultaneous capture at multiple layers, time-aligned to the workload lifecycle, and reconciled into a single attestation. We currently identify four capture surfaces, each with different fidelity and trust properties:

NVML — GPU board-level

NVIDIA Management Library exposes per-GPU power draw at millisecond resolution. It is the highest-fidelity reading available without external instrumentation, but it measures only the GPU board — not CPU, RAM, networking, or cooling overhead. For dense AI workloads (where the GPU dominates energy), NVML alone captures 70-85% of the signal.

IPMI / Redfish — node-level

Baseboard management controllers report whole-node power draw, including CPU, memory, and on-board I/O. Resolution is typically 1-5 seconds. This catches the energy NVML misses for the same physical machine. Reconciling NVML and IPMI lets us separate compute energy from node overhead — itself a meaningful efficiency metric.

PDU — rack-level

Smart power distribution units measure energy at the rack inlet. This captures everything the node does not — network switches, fans, top-of-rack overhead — and serves as the ground-truth check against which NVML+IPMI readings can be validated. Discrepancies between the sum of node readings and PDU readings are themselves a diagnostic signal.

Facility — building-level

The aggregate we started with. Useful as the outermost envelope and for cross-checking PUE, but not the subject of measurement. Facility totals are the context for per-workload readings, not a substitute for them.

The technical challenge is not access to any one of these sensors — they have existed for years. The challenge is capturing them all, synchronously, attributed to the workload that caused them, with a tamper-evident chain of custody. That is an attestation problem, not a metering problem.

From measurement to proof

A measurement that cannot be independently verified is a claim, not evidence. The AI industry has been content to treat self-reported sustainability numbers as evidence because there has been no alternative. We propose there should be one.

The path from raw sensor reading to verifiable claim has four steps:

  1. Capture. Sensor values are read at the workload boundary, time-stamped from a monotonic clock, and annotated with the workload identifier, model hash, and runtime context.
  2. Sign. Each capture is signed with a hardware-rooted Ed25519 key whose private half never leaves the measurement environment. The signature binds the reading to the machine that produced it.
  3. Bundle. Signed captures are aggregated into a Merkle tree per workload run, producing a single root hash that summarises the entire energy history of the computation.
  4. Anchor. The Merkle root is published to a public ledger (in our production system, Polygon mainnet), giving the entire workload a permanent, third-party-witnessed timestamp.

The result is a certificate: a portable artefact that anyone, anywhere, forever, can verify against the public anchor without trusting the issuer. The operator cannot revise the number after the fact. The customer cannot dispute that the reading came from the hardware. The regulator does not need to take anyone's word.

Verifiability is the difference between a sustainability claim and an engineering fact. The technology to cross that line already exists. § 05 — From measurement to proof

Per-workload primitives

Once per-workload measurement is real, a whole class of derived metrics becomes possible that simply could not exist under aggregate reporting. Each of these is a different slice of the same underlying signal:

Primitive Definition Signal it carries
η · joules / token Total signed energy ÷ tokens produced Inference efficiency at workload boundary
η · joules / epoch Training energy ÷ epoch count at target loss Training implementation quality
utilisation curve GPU SM occupancy time-series across the run Code-side bottlenecks, idle taxes
thermal envelope Joules per °C above ambient, normalised Cooling-design efficiency for this workload
memory pressure ratio VRAM-bound watt-seconds ÷ compute-bound watt-seconds Whether the workload is bandwidth- or FLOP-limited
idle tax Energy consumed while GPU SM occupancy < 5% Allocation waste — scheduling or pipelining issue

None of these can be computed from a facility-level number. All of them can be computed from a stream of signed per-workload captures. They are not exotic — they are the observability primitives every serious AI operator already wants. The barrier has been the absence of an evidence layer that makes them portable, comparable, and provable.

Operational economics

Per-workload efficiency has direct consequences for compute economics — and these are where the abstract argument becomes a budget line.

Procurement

Buyers can specify and verify efficiency targets in contracts. A foundation model buyer can require ≤ X kWh per million tokens as a deliverable, with the signed certificate as proof of compliance. Today this is impossible because there is no honest unit to contract against.

Scheduling

Orchestrators can route workloads not only by latency and cost, but by attested efficiency. A model with a tight thermal envelope can be preferred for racks where cooling headroom is limited. A bandwidth-heavy workload can be routed away from nodes where memory is shared.

Capacity planning

Aggregate forecasts under-fit reality. Workload-aware capacity models — built from historical per-workload signatures — produce better predictions of energy demand, peak load, and cooling requirements. The forecast becomes a function of the workload mix, not just the headcount of GPUs.

Regulatory reporting

Frameworks like CSRD, the EU AI Act, and GHG Scope 3 reporting are converging on per-workload accountability. Self-reported aggregates will not survive serious audit. Signed per-workload certificates do.

The infrastructure feedback loop

The deepest consequence of per-workload measurement is not the metric itself. It is what becomes knowable about the physical substrate once you have enough of them.

A datacenter operator today designs cooling, density, power topology, and networking from manufacturer spec sheets and prior-generation experience. They build the building, then observe what the building does. The observation comes after the commitment.

With years of signed per-workload telemetry, the order can invert. Cooling can be designed against measured thermal envelopes of the workloads that will actually run. Density can be planned against measured utilisation distributions, not nameplate TDPs. Power topology can be sized against measured peak-to-mean ratios for the actual workload mix. The building is designed from the physics of the computation, not the marketing of the silicon.

Most operators build first and learn after. The operators that learn first and build after will be years ahead architecturally. § 08 — The feedback loop

This is the inversion that matters. Workload-level efficiency measurement is not just a reporting improvement. It is the data substrate for physics-aware compute infrastructure — the next generation of facility design, where the workload informs the building rather than the building constraining the workload.

Open research directions

The shift to per-workload measurement opens questions that the aggregate framing simply could not formulate. We list the ones we consider most consequential, and most tractable in the next 24-36 months:

Conclusion

The energy consumed by AI is not a building's problem. It is a workload's property. Until measurement reflects that, the industry will continue to report numbers that flatter the operator and inform no one. The technical components required to make the shift — sensor stacks, hardware-rooted signing, Merkle anchoring, transparent public verification — are not speculative. They exist in production today.

What remains is to make them the default: to expect, contract for, and verify per-workload efficiency in the same way the industry already expects, contracts for, and verifies cryptographic signatures, code provenance, and supply-chain integrity. The energy footprint of computation deserves the same epistemic standard.

Serial Alice exists to build that default. This paper is the first formal articulation of the thesis that underwrites the rest of our research programme — and the architecture of the infrastructure we intend to inform with it.

Per-workload efficiency
is the new unit of compute.

If your organisation operates AI infrastructure, procures inference at scale, or regulates the energy footprint of computation — we'd like to talk.