Disk Survey

Surveying the disks of the world

Disk Error Recovery

| Comments

When a disk hits a problem spot it will try to recover from it in various ways, the most obvious way is for it to simply wait another rotation and retry reading. Another option it has is to seek away and back into location and see if it lands on the right spot this time. The specifics of the disk recovery are not openly described and it is not very useful to know their specifics anyway. A few points are however very interesting to any study of disk failures.

The first and most obvious point is that this takes time, possibly a lot of time. If the disk has a small problem and the first recovery option got it the delay will hardly be noticeable, if however only the last option managed to recover the data it already took multiple seconds, the exact number depends on the disk. Seagate spec for the Savvio 15K.3 enterprise disk (SAS) reports in section 10.2 that the disk has up to 20 recovery steps on read and 6 steps on write with a maximum time of 1.5 second per LBA. The per LBA point is important, if there is a problem in more than one LBA the times add up as an 8K request where all LBAs require full recovery will take 24 seconds, for this one particular request.

SAS disks have specific options in them to control the maximum time of error recovery, Read Retry Count and Write Retry Count limit the number of steps performed, the Recovery Time Limit in the Error Recovery mode page limits the total time of error recovery for a request. SATA disks provide similar features in some models such as the Seagate Error Recovery Control (ERC), the Western Digital Time Limited Error Recovery (TLER) and the Samsung/Hitachi Command Completion Time Limit (CCTL). It is important to remember that limited the recovery time of the disk improves overall system performance but it increases the chance to get a Media Error response which essentially means that the data in that sector is lost.

Both the SAS and the SATA mechanisms are mostly useful for RAID setups where the data can be recovered from another component of the RAID group, for a desktop disk you’d do better by letting the disk do its full recovery and if it repeats too often assume the disk is dying and replace it.

For SAS disks it is possible to get some sense of what happens if the disk is configured to report recovered errors and the sense reports are parsed for the SKSV part, this includes the number of error recovery stages attempted for the recovery and it may be useful to track this for informational purposes. It may be of limited value though since disks may also use non-linear error recovery procedure, this means that the a step may be skipped if the disk deems it unlikely to be useful. such a dynamic error recovery procedure will also reduce the time to recovery.

SSDs have a similar issue, they will also retry in various forms and the same problem exists for them as well, except that the time scale is much smaller and the full error recovery is unlikely to take more than 10 seconds per block. Enterprise level SSDs also feature an internal RAID structure where they can recover the data even if the block itself is dead, this will cause an even greater disturbance since this will eliminate any parallelism that is possible inside the SSD.