Something fails every 2.7 hours. The run keeps going.
At 16,384 GPUs a hardware fault lands every couple of hours, and today one dead node can leave the other 1,023 sitting idle. Mithril recovers the run step by step, so a single failure never forces the whole job to roll back.
Scale turns rare faults into constant ones.
A single GPU is reliable. A thousand of them running in lockstep are not, because failure compounds with size. Training assumes every rank advances together, so the moment one drops, the entire job stalls and rolls back to its last checkpoint. Hours of compute disappear on a fleet that bills by the GPU‑hour, all from a fault that touched a fraction of a percent of the cluster.
Route around the fault. Keep the step.
Mithril runs inside the training loop as a reliability layer, watching every rank and ready to act the instant a fault lands instead of waiting for the next checkpoint.
Today the only options are to roll the whole job back or babysit it by hand. Mithril makes the failure invisible to the run.
Every idle GPU is on someone's invoice.
Neoclouds live and die by utilization. When a customer's run stalls on a single fault, the meter keeps running and the bill arrives all the same. Reliability here is the margin, and it is a customer‑facing advantage the larger incumbents were never built to sell.
- 01 Neocloud and GPU‑rental fleets, CoreWeave class
- 02 Capacity providers billing by the GPU‑hour
- 03 Labs running 10k+ GPU pretraining
- 04 Anyone whose run is too big to babysit by hand
We start by keeping runs alive. We end as the layer the buildout stands on.
Mithril was the strongest material in Middle‑earth and the lightest, a mesh worn unseen beneath everything that simply does not break. As clusters grow from thousands of GPUs into hundreds of thousands, reliability stops being a feature and becomes infrastructure. We are building the mesh beneath the world's largest training runs: invisible, load‑bearing, and unbreakable.
The clusters only get bigger. Build the layer that holds.
We are selecting a small set of neocloud partners to measure failure waste on a real fleet.