Log in

No account? Create an account


Previous Entry Share Next Entry
09:59 pm: We never failed to fail … it was the easiest thing to do …

I spent my day waiting for a system crash.

We’ve got this serious, hardcore system lockup that happens sometimes with the customer of the moment. Naturally, it only happens when the system is flat-out running their most important computation. Naturally, it can reasonably said that if this code doesn't work, then the system is useless to us.

Basically, if I get all the systems rocking on a parallel task, then sometimes after a few hours, one of them will crash hard enough that it doesn’t respond to ping. Naturally this takes the whole parallel job with it. Ctrl-alt-delete (through the KVM) doesn’t cut it. Needs a finger-on-the-button hard reboot. Of course, I have remote control over the power outlets for these systems … do I don’t have to fly to Maryland every time this happens … but still.

So, I started down the list: Power, cooling, bad memory, flaky filesystems, …

At the same time I was working down another list: Too much memory use? Too many tasks, colliding on some secret lock-file? Oversubscribing the NFS server

And yet a third: Bad input? Crappy data files?

Finally, I would get the system rocking and go on to other tasks … until after a while I would exclaim:


Then I would try again.

Thought I got it. I really did. Left. Dropped the laptop at the hotel. Went to yoga. Good, relaxing, and the knee even handled it well.

When I got back to the hotel, I looked at the computer and said:


Debug debug debug …

Originally published at chris.dwan.org. You can comment here or there.

Powered by LiveJournal.com