Operational resilience is usually sold as a process: runbooks, on-call rotations, recovery time objectives. All of that is real, and all of it is heroics, humans compensating for an architecture that fails in ways it did not have to.
Why stateful endpoints fail badly
A persistent OS can be corrupted. A kernel-level agent can ship a bad update to every machine at once. When that happens, recovery means touching each device, and at fleet scale, that is days of downtime and a very long night.
There is no persistent OS to corrupt and no kernel agent to push a bad update into. The device reboots to known-good and keeps running.
Resilience as a property, not a procedure
On a stateless substrate, the worst day of the year becomes a non-event. No fire drill, no truck rolls, no all-hands at 3 a.m. The device does the only thing it knows how to do: boot clean and run authorized work.
That is resilience by architecture. It does not depend on anyone being awake.

