The frame problem
When a hyperscaler announces that its datacenter operates at a PUE of 1.10, what has actually been said? That for every unit of energy delivered to compute, 0.10 units were spent on cooling, lighting, power conversion, and overhead. It is a real number, and a useful one — for facility engineering. It tells the world almost nothing about whether the compute that consumed those joules was well-spent.
A PUE-optimised facility running a poorly-implemented attention kernel at 40% GPU utilisation is, in the only sense that should matter, less efficient than a PUE-1.4 facility running the same workload at 92% utilisation. The first looks excellent on a sustainability report. The second produced more useful computation per joule consumed. The reporting frame chose the wrong subject. The building is not the workload.
Worse: PUE is reported as an annualised average, smoothed across all workloads, all seasons, and all operating regimes. Two facilities with identical PUEs may be doing fundamentally different work. Two workloads in the same facility may have wildly divergent energy footprints. The aggregate hides everything that operators, regulators, and customers should be able to see.
You cannot manage what you do not measure, and you cannot trust what you cannot verify. Operational premise — Serial Alice Research
The unit of measurement
If the building is the wrong subject, what is the right one? We argue the only honest unit
is energy per unit of useful work delivered. The definition of "useful work" varies
by workload, and that is acceptable — even necessary. For training, the unit may be
kWh per epoch at a target loss. For inference, kWh per million tokens
or kWh per accepted completion. For batch jobs, kWh per record.
The shift from facility metrics to workload metrics is not a refinement. It is a change of subject. It also changes who is accountable. A facility PUE belongs to the operator. A per-workload efficiency belongs to whoever wrote the code, who chose the model, who scheduled the run. That is uncomfortable. It is also where real engineering progress becomes possible.
Why averages lie
A 30-day rolling average of datacenter energy consumption is a flat line. The reality underneath is anything but. AI workloads are bursty, heterogeneous, and shape-dependent. Training jobs spike then idle. Inference clusters oscillate with diurnal request patterns. Fine-tuning runs vary by orders of magnitude depending on batch size, sequence length, and gradient accumulation strategy. The same model, deployed on the same hardware, can vary in joules-per-token by 4-7× between best-case and worst-case configurations.
Averages are not analytically neutral. They actively obscure the signal an operator most needs: which workloads are well-implemented and which are not. The well-implemented one and the broken one consume the same joules on the facility bill, but only the broken one is fixable. Without per-workload measurement, you cannot tell them apart.
The measurement stack
Measuring energy at the workload boundary is a systems problem. It requires simultaneous capture at multiple layers, time-aligned to the workload lifecycle, and reconciled into a single attestation. We currently identify four capture surfaces, each with different fidelity and trust properties:
NVML — GPU board-level
NVIDIA Management Library exposes per-GPU power draw at millisecond resolution. It is the highest-fidelity reading available without external instrumentation, but it measures only the GPU board — not CPU, RAM, networking, or cooling overhead. For dense AI workloads (where the GPU dominates energy), NVML alone captures 70-85% of the signal.
IPMI / Redfish — node-level
Baseboard management controllers report whole-node power draw, including CPU, memory, and on-board I/O. Resolution is typically 1-5 seconds. This catches the energy NVML misses for the same physical machine. Reconciling NVML and IPMI lets us separate compute energy from node overhead — itself a meaningful efficiency metric.
PDU — rack-level
Smart power distribution units measure energy at the rack inlet. This captures everything the node does not — network switches, fans, top-of-rack overhead — and serves as the ground-truth check against which NVML+IPMI readings can be validated. Discrepancies between the sum of node readings and PDU readings are themselves a diagnostic signal.
Facility — building-level
The aggregate we started with. Useful as the outermost envelope and for cross-checking PUE, but not the subject of measurement. Facility totals are the context for per-workload readings, not a substitute for them.
The technical challenge is not access to any one of these sensors — they have existed for years. The challenge is capturing them all, synchronously, attributed to the workload that caused them, with a tamper-evident chain of custody. That is an attestation problem, not a metering problem.
From measurement to proof
A measurement that cannot be independently verified is a claim, not evidence. The AI industry has been content to treat self-reported sustainability numbers as evidence because there has been no alternative. We propose there should be one.
The path from raw sensor reading to verifiable claim has four steps:
- Capture. Sensor values are read at the workload boundary, time-stamped from a monotonic clock, and annotated with the workload identifier, model hash, and runtime context.
- Sign. Each capture is signed with a hardware-rooted Ed25519 key whose private half never leaves the measurement environment. The signature binds the reading to the machine that produced it.
- Bundle. Signed captures are aggregated into a Merkle tree per workload run, producing a single root hash that summarises the entire energy history of the computation.
- Anchor. The Merkle root is published to a public ledger (in our production system, Polygon mainnet), giving the entire workload a permanent, third-party-witnessed timestamp.
The result is a certificate: a portable artefact that anyone, anywhere, forever, can verify against the public anchor without trusting the issuer. The operator cannot revise the number after the fact. The customer cannot dispute that the reading came from the hardware. The regulator does not need to take anyone's word.
Verifiability is the difference between a sustainability claim and an engineering fact. The technology to cross that line already exists. § 05 — From measurement to proof
Per-workload primitives
Once per-workload measurement is real, a whole class of derived metrics becomes possible that simply could not exist under aggregate reporting. Each of these is a different slice of the same underlying signal:
| Primitive | Definition | Signal it carries |
|---|---|---|
| η · joules / token | Total signed energy ÷ tokens produced | Inference efficiency at workload boundary |
| η · joules / epoch | Training energy ÷ epoch count at target loss | Training implementation quality |
| utilisation curve | GPU SM occupancy time-series across the run | Code-side bottlenecks, idle taxes |
| thermal envelope | Joules per °C above ambient, normalised | Cooling-design efficiency for this workload |
| memory pressure ratio | VRAM-bound watt-seconds ÷ compute-bound watt-seconds | Whether the workload is bandwidth- or FLOP-limited |
| idle tax | Energy consumed while GPU SM occupancy < 5% | Allocation waste — scheduling or pipelining issue |
None of these can be computed from a facility-level number. All of them can be computed from a stream of signed per-workload captures. They are not exotic — they are the observability primitives every serious AI operator already wants. The barrier has been the absence of an evidence layer that makes them portable, comparable, and provable.
Operational economics
Per-workload efficiency has direct consequences for compute economics — and these are where the abstract argument becomes a budget line.
Procurement
Buyers can specify and verify efficiency targets in contracts. A foundation model
buyer can require ≤ X kWh per million tokens as a deliverable, with the
signed certificate as proof of compliance. Today this is impossible because there is
no honest unit to contract against.
Scheduling
Orchestrators can route workloads not only by latency and cost, but by attested efficiency. A model with a tight thermal envelope can be preferred for racks where cooling headroom is limited. A bandwidth-heavy workload can be routed away from nodes where memory is shared.
Capacity planning
Aggregate forecasts under-fit reality. Workload-aware capacity models — built from historical per-workload signatures — produce better predictions of energy demand, peak load, and cooling requirements. The forecast becomes a function of the workload mix, not just the headcount of GPUs.
Regulatory reporting
Frameworks like CSRD, the EU AI Act, and GHG Scope 3 reporting are converging on per-workload accountability. Self-reported aggregates will not survive serious audit. Signed per-workload certificates do.
The infrastructure feedback loop
The deepest consequence of per-workload measurement is not the metric itself. It is what becomes knowable about the physical substrate once you have enough of them.
A datacenter operator today designs cooling, density, power topology, and networking from manufacturer spec sheets and prior-generation experience. They build the building, then observe what the building does. The observation comes after the commitment.
With years of signed per-workload telemetry, the order can invert. Cooling can be designed against measured thermal envelopes of the workloads that will actually run. Density can be planned against measured utilisation distributions, not nameplate TDPs. Power topology can be sized against measured peak-to-mean ratios for the actual workload mix. The building is designed from the physics of the computation, not the marketing of the silicon.
Most operators build first and learn after. The operators that learn first and build after will be years ahead architecturally. § 08 — The feedback loop
This is the inversion that matters. Workload-level efficiency measurement is not just a reporting improvement. It is the data substrate for physics-aware compute infrastructure — the next generation of facility design, where the workload informs the building rather than the building constraining the workload.
Open research directions
The shift to per-workload measurement opens questions that the aggregate framing simply could not formulate. We list the ones we consider most consequential, and most tractable in the next 24-36 months:
- Workload DNA. Can a workload's energy signature be compressed into a short fingerprint that predicts efficiency on unseen hardware? Early evidence suggests yes for narrow classes of workloads; the open question is generalisation.
- Adaptive runtime scheduling. Given live signed telemetry, how fast can a scheduler adapt placement decisions to drift in workload behaviour without violating the cryptographic chain of custody?
- Verifiable efficiency benchmarks. Standardised, attestable benchmark suites that report energy as a first-class output, comparable across vendors and substitutable into procurement contracts.
- Cross-vendor attestation. NVML is a vendor library. The proof story should not be vendor-locked. What is the minimal hardware root of trust that lets AMD, Intel, and emerging accelerators participate on equal terms?
- Tiny efficient models. The smallest model that solves a given task at a given quality is, by definition, the most energy-efficient one. How much can be quantified, distilled, and pruned away before quality degrades — and how do we prove it?
Conclusion
The energy consumed by AI is not a building's problem. It is a workload's property. Until measurement reflects that, the industry will continue to report numbers that flatter the operator and inform no one. The technical components required to make the shift — sensor stacks, hardware-rooted signing, Merkle anchoring, transparent public verification — are not speculative. They exist in production today.
What remains is to make them the default: to expect, contract for, and verify per-workload efficiency in the same way the industry already expects, contracts for, and verifies cryptographic signatures, code provenance, and supply-chain integrity. The energy footprint of computation deserves the same epistemic standard.
Serial Alice exists to build that default. This paper is the first formal articulation of the thesis that underwrites the rest of our research programme — and the architecture of the infrastructure we intend to inform with it.