Excuse Me, Your SCSI Is Slipping

So you’ve got a hot-swap SCSI RAID array and you figure you are covered, even if a disk fails. Well, that is usually true. Kinda. Here is a long story with a few short conclusions about what to do when SCSI drives start failing.

Environment

Five-year-old Dell PowerEdge 1500SC with Perc 3/SC RAID controller. Four 18GB SCSI disks configured as a RAID 5 array, with a fifth 18GB available as a hot spare. Sixth disk is a 36GB drive configured as RAID 0 (not redundant). Windows 2003 R2.

In this artilce, I’ll call the physical disks Disk 1 through Disk 6, corresponding to what the SCSI BIOS calls 0:0 through 0:5.

The Drama

First of all, kudos to Dell OpenManage with an IT Assistant workstation. What a pain to set up, but without them, I wouldn’t have known about these disk issues until who knows when. The client’s server sent SNMP warnings to my IT Assistant workstation (through a router-based VPN). Then IT Assistant sent me emails reporting the critical errors.

Disk 6 Failing

Disk 6, containing the 36GB non-redundant volume, started generating SCSI sense errors, which IT Assistant faithfully passed on to me. Example: “Alert message ID: 2095, Array disk warning. SCSI sense data Sense key: 3 Sense code:11 Sense qualifier: 1, Controller 0, Channel 0, Array Disk 0:5” (that’s Disk 6). After discussing with Dell support, I decided this drive is on its last legs. I moved data from this disk to an external drive and stopped using Disk 6. However I did not delete the virutal disk nor did I remove the physical disk. I figured as long as I didn’t need the data, it didn’t hurt to leave the failing drive in there for a while.

Disk 4 Failed

Suddenly, a timeout error appears in the event log:

Source: mraid35x
Event ID: 9
Description: The device, \Device\Scsi\mraid35x1, did not respond within the timeout period.

IT Assistant did not forward this message, but it did tell me that the virtual disk was “degraded,” which means that a disk had dropped out of the array. When I checked the server, I saw that Disk 4 had been dropped from the array, the hot spare (Disk 5) had automatically become active and was being rebuilt into the RAID array. So the automatic rebuild using the hot spare worked great, but why did Disk 4 fail in the first place? No hardware errors were reported on Disk 4 before or after it was kicked out of the RAID 5 array.

Using Dell Diagnostics under Windows, a Quick Test of Disk 4 reported that a Verify did in fact fail. (As an aside, the diagnostics caused the Terminal Server and web server to stop responding for some time, perhaps because they were trying to gain access to the tape drive that was under control of the Veritas Backup Exec drivers. Fortunately the machine eventually “un-hung” itself, averting an emergency road trip to the client site.)

On Site: A Lovely BSOD and a Rare RAID BIOS Issue

The next day I went to the client site to assess the situation.

The first thing I wanted to do was get rid of Disk 6 (the one with SCSI sense errors). Under Windows, using the Dell Server Administrator (the local OpenManage application), I deleted the second virtual disk, then made Disk 6 ready for removal. Its lights flashed appropriately, then turned off. (Flashing lights are a good thing, reminding me that the drives are numbered from right to left.) I removed the left-most Disk 6, then restarted Windows. Windows took a long time shutting down, eventually displaying a blue screen with this message:

KERNEL_STACK_INPAGE_ERROR
Stop 0x00000077 (0xC00000185, 0xC00000185, 0x00000000, 0x00C1D000)

After generating a full memory dump, the system restarted. But the RAID BIOS got stuck on the message

Spinning SCSI devices…80%

Google has only four hits on the “Spinning SCSI devices” error. The one in English seems to be about Linux. However a search of the Microsoft Knowledge Base for the kernel error was more fruitful. A quick review of this article:

Troubleshooting “Stop 0x00000077” or “KERNEL_STACK_INPAGE_ERROR”
http://support.microsoft.com/kb/315266

led to the conclusion that Windows had encountered a SCSI termination problem (“0xC0000185, or STATUS_IO_DEVICE_ERROR: improper termination or defective cabling of SCSI-based devices, or two devices attempting to use the same IRQ.”) That makes some sense, since this was the last device in the SCSI chain, but isn’t the RAID controller supposed to allow removing a drive and dynamically adjust the termination? Well, maybe not.

I inserted the failed Disk 6 back into its slot and turned on the machine. No more “Spinning SCSI Devices” message. Windows booted fine and shut down fine.

This time I booted into the RAID BIOS and double-checked that Drive 6 is not in use (in other words, it’s in a Ready state). I couldn’t find an option to make the physical drive “ready for removal,” so I just powered off the server, removed Disk 6, and powered on again. Still no “Spinning SCSI Devices” message. Conclusion: the RAID controller needs to cycle power to change SCSI termination.

Re-Testing Disk 4

Alright, now what about that Disk 4 that supposedly failed? It’s a Quantum Atlas 10K III. I wanted to run the vendor’s SCSI Max tests against it (downloaded from Seagate’s site), but that only runs from a Windows 98 boot diskette, and that diskette hung when searching the PCI bus of the server. I had doubts that it would be compatible with the Perc RAID controller. My backup plan was to take Disk 4 back to the office and insert it into an old PowerEdge 2400 that has a SCSI backplane but no RAID controller. My hunch is that this would allow accessing the drive as a “normal” SCSI device.

But while I’m on site, let’s try the Dell diagnostics. During system boot, I pressed F10 to start the Dell system utilities from the utility partition. I ran the Quick Test on that drive: Pass! Then the Extended Test (45 minutes): Pass! So hopefully Disk 4 is actually still good and was just showing errors because the failing Disk 6 was still in the chain.

At this point I went back in to the RAID BIOS to assign Disk 4 as a global hot spare. Oh–and while the server is down, I made a photocopy of Disk 4 so I have the serial number for verifying warranty status, should the need arise. Finally, after booting into Windows, I ran the Quick Test from the Windows version of Dell diagnostics. This test, which failed yesterday, now passed as well.

Lessons Learned

There are a couple things I will keep in mind for next time.

Work or Get Off the Bus

If a drive is generating SCSI errors, fix the errors or get the drive off the SCSI bus. Don’t leave the drive there, even if you aren’t using it, or it might cause problems with other drives.

Hot Swap Does Not Mean Hot Remove

Maybe you can replace a drive without shutting down the server. But to remove a drive, especially the last drive in the SCSI chain, remove the drive from the RAID configuration (whether under Windows or the RAID BIOS), then power down the server. Remove the drive and power up. This way, the RAID controller stands a better chance of adjusting termination.

Remove First from Windows, Then from RAID

Although this didn’t cause problems this time, it would probably be best to delete the volume under Windows Disk Management before removing its logical volume from the RAID configuration.