Fault tolerance for frontier‑scale training

Something fails every 2.7 hours. The run keeps going.

At 16,384 GPUs a hardware fault lands every couple of hours, and today one dead node can leave the other 1,023 sitting idle. Mithril recovers the run step by step, so a single failure never forces the whole job to roll back.

Become a design partner See the cost →

466

interruptions in 54 days. LLaMA 3 pretraining across 16,384 GPUs.

178,000

GPU‑hours lost to failures on a single OPT‑175B run.

31.19%

of capacity spent handling failures across a reported fleet.

84.8%

daily system failure at 1,000 GPUs, from a 1.5% per‑node rate.

The problem

Scale turns rare faults into constant ones.

A single GPU is reliable. A thousand of them running in lockstep are not, because failure compounds with size. Training assumes every rank advances together, so the moment one drops, the entire job stalls and rolls back to its last checkpoint. Hours of compute disappear on a fleet that bills by the GPU‑hour, all from a fault that touched a fraction of a percent of the cluster.

2.7hrs

Mean time to failure during LLaMA 3 pretraining. On average, a fault before lunch is over.

source: meta, industry data

100+

Hardware failures across a single OPT‑175B run, each one a restart and a hole in the schedule.

source: meta, industry data

1/ 1024

One failed GPU is enough to leave the remaining 1,023 idle while the job recovers.

the wedge

57×

How much more often a 1,000‑GPU system fails than a single node. Rarity does not survive scale.

source: arXiv survey 2407.20018

How it works

Route around the fault. Keep the step.

Mithril runs inside the training loop as a reliability layer, watching every rank and ready to act the instant a fault lands instead of waiting for the next checkpoint.

01Detect

Catch the fault in milliseconds

Per‑rank health instead of coarse heartbeats. Mithril sees a dead node the moment it dies, before the collective hangs the whole run.

02Isolate

Fence it and hold its shard

The faulted rank is quarantined and its state pinned, so the other 1,023 GPUs never sit idle waiting on it.

03Continue

Resume the step, not the job

Work re‑routes onto warm spares and continues from the current step, with no full‑job rollback. The run keeps moving forward.

Today the only options are to roll the whole job back or babysit it by hand. Mithril makes the failure invisible to the run.

Who feels it first

Every idle GPU is on someone's invoice.

Neoclouds live and die by utilization. When a customer's run stalls on a single fault, the meter keeps running and the bill arrives all the same. Reliability here is the margin, and it is a customer‑facing advantage the larger incumbents were never built to sell.

01 Neocloud and GPU‑rental fleets, CoreWeave class
02 Capacity providers billing by the GPU‑hour
03 Labs running 10k+ GPU pretraining
04 Anyone whose run is too big to babysit by hand

The vision

We start by keeping runs alive. We end as the layer the buildout stands on.

Mithril was the strongest material in Middle‑earth and the lightest, a mesh worn unseen beneath everything that simply does not break. As clusters grow from thousands of GPUs into hundreds of thousands, reliability stops being a feature and becomes infrastructure. We are building the mesh beneath the world's largest training runs: invisible, load‑bearing, and unbreakable.

The clusters only get bigger. Build the layer that holds.

We are selecting a small set of neocloud partners to measure failure waste on a real fleet.

Become a design partner team@mithril.run →