I spent my day waiting for a system crash.
We’ve got this serious, hardcore system lockup that happens sometimes with the customer of the moment. Naturally, it only happens when the system is flat-out running their most important computation. Naturally, it can reasonably said that
if this code doesn't work, then the system is useless to us.
Basically, if I get all the systems rocking on a parallel task, then sometimes after a few hours, one of them will crash hard enough that it doesn’t respond to
ping. Naturally this takes the whole parallel job with it. Ctrl-alt-delete (through the KVM) doesn’t cut it. Needs a finger-on-the-button hard reboot. Of course, I have remote control over the power outlets for these systems … do I don’t have to fly to Maryland every time this happens … but still.
So, I started down the list: Power, cooling, bad memory, flaky filesystems, …
At the same time I was working down another list: Too much memory use? Too many tasks, colliding on some secret lock-file? Oversubscribing the NFS server
And yet a third: Bad input? Crappy data files?
Finally, I would get the system rocking and go on to other tasks … until after a while I would exclaim:
Then I would try again.
Thought I got it. I really did. Left. Dropped the laptop at the hotel. Went to yoga. Good, relaxing, and the knee even handled it well.
When I got back to the hotel, I looked at the computer and said:
Debug debug debug …