Curse you, RAID!

| Comments (1) |
Brian Korver and I spent most of last night recovering severe filesystem corruption issues on a pair of our servers. For background for the rest of this post, here's the configuration of the machines:
  • Pentium 4 3 GHz
  • ASUS P4C800-E motherboards
  • 2x240 GB Maxtor SATA hard drives
  • FreeBSD 5.3/5.4

These ASUS motherboards have four SATA connectors, two tied to a Promise PDC20378 RAID controller and two tied to an Intel IHC5R. We had the drives set up in a RAID 1 (mirroring) configuration using the Promise controller, on the theory that this would provide high availability in the case of drive failure. Now, I'm the first to admit that my knowledge of RAID is sketchy and as I was about to discover, the gap between theory and practice is rather large in this case. Call this critical mistake 1.

The precipitating event for last evening's festivities was a hardware fault in one of the drives on machine #1 during a routine operating system upgrade. This caused the machine to hang. Upon reboot I discovered to my extreme dismay that fsck was also hanging trying to fix the hard drives. So much for high availability. Time to haul out the fixit CD, reboot, etc. Now, the thing you need to know about these machines. They're physically completely identical and they're installed in the rack one on top of the other (critical mistake #2) and so in the process of trying to salvage machine 1, I accidentally frobbed the reset switch of machine 2 (critical mistake #3), and apparently on these machines ACPI isn't set up to treat this as a clean powerdown (critical mistake #4).

Now, ordinarily this just entails a reboot and some machine slowage while the background fsck runs, but in this case we got the extra bonus of the fsck hanging on startup. Now, remember that unlike machine 1, there's absolutely nothing wrong with the hardware machine. It was pure pilot error and fsck should be able to recover, but what we're actually seeing in practice is that it halts (by which I mean complete machine unresponsiveness) to even the first hint of filesystem corruption. It's not just fsck, either--trying to make a tarcopy of the disk also causes hangage. Needless to say, this is not usual behavior for FreeBSD, which normally fscks just fine. The only other time I've seen this is on one of these machines about 6 months ago, and so at this point I'm starting to suspect that the problem is the RAID system or the FreeBSD RAID drivers and that I'm basically hosed.

Needless to say, I was a little despondent about the meltdown, but in the midst of my preparations for hari-kari, it came to me [*]: these drives are set up in a mirrored configuration, so each drive has a complete set of filesystems on them. The kernel can talk to each drive individually, so what if we just try to mount one of them directly rather than throuth the RAID? We try it and it superficially works, leading to the following restore procedure:

  1. Disconnect both drives from the Promise controller.
  2. Connect one drive up to the IHC5R, leaving--and this is important--the other unconnected as a backup in case we hose something on the restore.
  3. Turn off RAID on the IHC5R (it's got RAID 0 support)
  4. Boot up the kernel in single-user mode. At this point it hangs because it's trying to find the root partition on the RAID /dev/ar0s1a, which no longer exists.
  5. Key the right partition (/dev/ad10s1a) in at the provided prompt.
  6. Once the machine comes up single-user carefully fsck all the filesystems.
  7. Edit /etc/fstab so that all references to /dev/ar0s? point to /dev/ad10s1?.
  8. Reboot the machine.

This all worked fine for machine 2 (it was more important so we fixed it first) but when we got to machine 1, we got a DMA error off drive A (remember the hardware fault that kicked all this off?). No problem we'll use drive B. But when we tried it the system couldn't see drive B at all. A little investigation revealed that in all the screwing around we'd managed to knock the SATA connector off of drive B (stupid interference fit SATA connectors). Plugging that in seemed appropriate and let us see the drive.

At that point, we had working systems (without RAID and they're staying that way) and so were able to finish up:

  1. Take a fresh backup and put it in a safe place.
  2. Finish the system upgrade that we were doing when things went to hell.
Total elapsed time from first failure to going home for the evening: 7 hours.


One method of providing extra backup for this is to take a second RAID set or just a single drive and replicate to it periodically. Then if the drives are hosed, at least you have an almost real-time backup.

Leave a comment