Thursday, May 10, 2012

Cray: I broke all the things

Not actually all the things, but I did hose one system.

The OS update this morning ran out of disk space. Not to worry, I was just trained how to safely delete some old update files that are no longer needed. I ran the special program, and it worked. I ran it again, this time giving it several sets of files to prune, but it took a long time to run, and there was already enough free space by then, so I stopped it halfway through (I was told this was safe).

Somehow, either because I stopped the program, or who knows why, the system got corrupted and had to be restored from backup. It's still not working yet.

Here's what I learned:

- It's important to log all your activity (which I did), so the debugging folk have a better chance of finding what went wrong
- It's important to do frequent backups of your dev systems (last backup on this system was over a month old)
- My Cray coworkers are super smart when it comes to problem-solving. I described the problem, and this sparked a very lively (and intelligent) conversation of how to solve it. I'm used to having to solve most things on my own, so this is a nice change.
- At the end of the day, whether it was my fault or a bug in the program I ran, these things happen. People aren't perfect, code's not perfect, and that's why we have jobs.

No comments: