Predicting transformer failure 19 days early

§1

The problem

// the cost nobody puts on the board pack

A medium-voltage transformer failure costs £180,000–£420,000 all-in, and most of that cost is reactive. The asset itself is between £40k and £120k depending on rating; the rest is the emergency outage, the temporary supply arrangements, the crew time at antisocial hours, the transport and craning costs for an unplanned replacement, and — for any DNO with a meaningful regulatory price-control regime — the customer-minutes-lost penalty that flows through into the next price-control settlement. Catch the developing fault 19 days early and most of that cost vanishes: you plan the outage at a normal-rate window, you align it with adjacent maintenance, and the customer-minutes-lost figure stays out of the regulator-facing report.

The reason this problem persists is not technical. It is that the failure rate per transformer per year is low — typically 0.4–0.9% for medium-voltage assets, lower for the well-maintained subset — and the false-positive cost of a "predicted" failure that does not materialise is high enough to make conservative operators dismiss the whole class of model. We took that seriously when we built FATHOM Sentinel: the 0.31 false-positives-per-asset-per-year figure we publish is more important to operators than the 19-day median lead time, and it is the figure we have spent the most engineering hours getting right.

§2

Multi-modal inputs

// six signals, none decisive alone

FATHOM Sentinel reads six signals per monitored transformer. Dissolved-gas analysis (DGA) — typically at 15-minute cadence on a continuously-monitored asset, daily on a more conventional sampling regime — gives the chemistry view: developing arcing, partial discharge, thermal events, and paper degradation all leave DGA signatures. Partial-discharge waveforms from on-asset PD sensors give the high-frequency electrical view; the waveform morphology distinguishes corona from surface tracking from internal discharge, all of which have different predictive value. Top-oil and winding-hot-spot temperature against the IEEE C57.91 thermal model give the thermal-physics view. Load history contextualises the thermal model. Weather — ambient temperature, solar radiation, wind speed — calibrates the thermal model further. Crew-reported audible and visual cues from field maintenance enter as structured observations with timestamp and asset ID.

Each signal is weak on its own. DGA produces enough seasonal and load-driven variation that a threshold-only approach generates one to three false alarms per asset per year; PD waveforms have enough environmental noise that pattern-matching alone over-fits; thermal residuals against the C57.91 model are sensitive to load assumptions that are themselves modelled. Together, however, the six signals are decisive — a developing winding fault produces a correlated pattern across DGA, PD, thermal residual and load history that none of them shows in isolation. The crew-reported "slight humming on TX-308" is the canonical example: on its own, it is one data point in a maintenance log; in combination with a thermal residual that is 7% above expected and a PD count that has tripled over the previous fortnight, it is a turn-to-turn fault.

§3

The model

// graph transformer fused with 1-D conv

The core model is a graph transformer over a small grid neighbourhood — the monitored asset plus its electrical neighbours within two electrical hops — fused with a 1-D convolutional tower for partial-discharge waveform analysis. The graph transformer captures the cross-asset structure (a developing fault on one transformer often shows up as small perturbations on its electrical neighbours before the fault asset itself crosses any single-asset threshold). The 1-D conv tower handles the morphology of PD waveforms at native sample rate — typically 100 Msa/s on modern PD sensors — without down-sampling to the point where the discriminating shape is lost.

The two heads are fused into a shared latent state which is then decoded against the physics residual. As described on the Sentinel module page, the reconstruction loss is augmented by a physics-residual term derived from the IEEE C57.91 thermal model. A Sentinel event is only emitted when the reconstruction is good and the physics residual is bad — which is the only combination that reliably distinguishes a developing fault from a sensor problem. That single architectural choice is the reason the false-positive rate is 0.31 per asset per year rather than the 1.4–2.1 per asset per year that a pure-pattern-matching approach gives on the same dataset.

The model was trained on a multi-DNO fault library of 12,400 historical assets including 2,180 confirmed failures. Strict temporal hold-outs by both calendar year and DNO operator prevent leakage between training and validation. The validation set was deliberately curated to include the failure modes that we knew were under-represented in the training data — bushings, OLTCs, and the rare-but-expensive category of inter-turn faults on dry-type units.

§4

Field calibration

// 14 months · two UK DNOs

Lab models always overpromise. We ran a 14-month live trial across two UK DNOs covering 1,842 transformers and 412 battery racks. Sentinel ran in advisory mode for the first six months and in operational mode for the remaining eight, with crew action taken on Sentinel events under a controlled escalation protocol that gave the engineering team final authority. The 19-day median lead time is the calibrated number from that trial — calibrated to the date on which the partner DNO's existing condition-monitoring tooling would have flagged the same asset, or to the date of failure if no flag was raised. The 95% confidence interval is 7–34 days; the tail on the right is dominated by slow-developing winding faults on lightly-loaded assets, where Sentinel can see the trajectory long before any threshold-based tool can.

The 0.31 false-positives-per-asset-per-year figure is the rate at which Sentinel flagged an asset for intervention and the subsequent inspection found nothing that would have warranted intervention in the following 90 days. This is a deliberately conservative definition — many of those "false positives" found early-stage degradation that would have become a flag within twelve months — but we report the strict figure because it is the figure the asset-management team will measure us against.

§5

What operators must change

// the data-discipline conversation

Stop discarding DGA samples after one read. Stop storing PD waveforms only when a fault is suspected — the developing-fault waveforms in the months before a failure are exactly the signal Sentinel needs to learn against, and they are exactly the waveforms most asset-management systems quietly discard. Keep top-oil samples at full cadence, not just the daily average. Digitise crew-reported cues into a structured form with timestamp, asset ID and free-text observation; we provide a phone-based maintenance interface for crews who would otherwise write on paper.

The data that makes Sentinel work is data that most asset-management systems quietly delete, because nobody has yet shown an operational use for it. That is the polite version. The less polite version is that the asset-management software industry has been optimised for storage cost rather than predictive value for the last twenty years, and the result is a fleet of CMMSs that throw away exactly the data that makes condition-based maintenance work. We do not blame the asset-management teams for this — they were responding to the constraints they were given — but we do say to every customer in the first conversation that the engagement starts with a data-discipline audit, not a model deployment.

The audit produces a one-page recommendation: what to start collecting, what to stop deleting, what to digitise, and what the storage cost is of the recommended changes (typically £8,000–£24,000 per year per substation for the data layer, against a potential avoided cost in the high six figures for a single deferred failure). The checklist version is published in the docs section so that operators not yet ready for an engagement can run the same audit themselves.

SIGNAL · DISCUSS

Want to argue with this essay?

FATHOM engineers read every reply. If you disagree with the framing — or have data that contradicts ours — we want to hear it. The position on "stop discarding DGA samples" is the one operators most often disagree with at first and most often agree with by month three.

open contact form →sentinel module →

BUS-69

Predicting transformer failure 19 days early: the model behind FATHOM Sentinel.