Drive Drops Off the RAID

A common complaint for drive failures in RAIDs is that the drive dropped off the RAID for no apparent reason, the drive is then tested seems functional and then it gets added to the RAID and works just fine in the RAID with no immediate issue. This may then repeat with the same drive every once in a while and is quite a bother. It is a bother since the RAID array now needs more maintenance and it seems to go into a degraded mode for what seems to be as no real reason.

While it is impossible discount a bug in the RAID software or firmware it is more than likely that the fault lies with the disk itself. The disk is likely to be having problems with some locations on its media and when it tries to perform disk error recovery, such retries can take a lengthy time and the result is that the RAID decides to drop the disk from the array for being unresponsive. The reason for this decision by the RAID is that there are IOs that the RAID received, transferred them to the disk and the disk didnt respond yet, this means that the RAID itself cannot respond to the external IO and so it risks causing problems upstream. The simplest thing the RAID can do is to just drop the disk from the RAID, go into degraded mode and recover the lost data from the RAID redundancy.

The result of this is that the disk will gain some quiet time that will let it do whatever background process it can to recover the bad media location and then when it is actually tested later on it will seem to be working very well and there will be no apparent reason not to add it back to the RAID.

The risk of adding the disk back into the RAID is that if there will be a failure in another device on that RAID and this semi-problematic disk fails later on shortly before that other failure manifests itself you risk data loss or a severe service loss from a double failure. You also increase the risk of hitting another media failure in the process of reading the data from the redundancy of the RAID while the RAID is in a degraded mode due to this disk failure.

In most cases a disk that begins to exhibit such problems is on a declining trend and is likely to fail sooner rather than later and it is best to remove it from the RAID and possibly only using it in an unimportant facility.

What a strong RAID could do instead of just dropping the disk is to keep it partially in the RAID and still let it rest, maybe deliver only writes to it or perhaps mark as bad blocks the area around which the problem manifests. That requires a very good deal of record keeping and guesses which makes it harder to make it work reliably but it is doable.

One thing that disk drive makers do to help the disks survive in the RAID is to limit the time the drive will attempt error recovery, each maker uses a different name and acronym, they are TLER/ERC/CCTL.