Disk Survey

Surveying the disks of the world

HDD Inside Out

| Comments

An important step in understanding disk failures is first to understand how a disk works, there are existing very well made descriptions available already so I will only point the reader to them:

 

Machine Learning That Matters

| Comments

I’ve seen several attempts to use machine learning to analyze disk failures, most of them focused on things that don’t matter and have little effect on the outside world. A nice paper that discusses it in the context of the Machine Learning community is “Machine Learning that Matters“.

A Definition of Disk Failure

| Comments

What is a disk failure?

Most, if not all, of the disk failure studies that I’m aware of do not even define or attempt to expose what was considered as a disk failure. The simple act of the disk being replaced signifies that the disk is considered as “failed” in the study. Unfortunately without a completely controlled study where each disk is accounted for with a log of the exact reasons for its replacement and the diagnostics that was attempted before it was replaced there is little chance to do better than that.

At the end, a disk failure is a completely subjective matter, also evidenced by the relatively high rates of NDF (No Defect Found) reports by the disk manufacturers. It is a constant struggle between the disk users and the disk vendors on the definition and the definitive proofs needed to define a disk as failed. Some of the failures are soft failures where the disk doesn’t satisfy the system design criteria, it may respond too slow or have a high variation in response times. Many times letting the disk rest but powered on would let the disk recover itself by its own mechanisms, but it may not be a viable option to do that when the disk or the slot is needed for actual work.

In the study I am trying to develop I will have to resort to user reports of drive failures as there is no strict control of the disks. I would hope to collect from the users the reason a disk was considered as failed to help improve the possible analysis later on.

 

Introduction

| Comments

This blog is about disk failures and I’m trying to describe what I know about disks and their failures as well as trying to create a disk survey project to collect data in an open way with an eye of studying and enabling others to study disk failure topics. Unfortunately there is little public information on the life cycle of disks and what does exist is scattered around and fairly hard to find, not to mention making sense out of it.

I hope with this project to collect the body of knowledge and contribute to it from my experience and even more importantly provide an open platform to collect data about disks and enable everyone to come up with their own hypotheses and attempt to validate them.

Disk Error Recovery

| Comments

When a disk hits a problem spot it will try to recover from it in various ways, the most obvious way is for it to simply wait another rotation and retry reading. Another option it has is to seek away and back into location and see if it lands on the right spot this time. The specifics of the disk recovery are not openly described and it is not very useful to know their specifics anyway. A few points are however very interesting to any study of disk failures.

The first and most obvious point is that this takes time, possibly a lot of time. If the disk has a small problem and the first recovery option got it the delay will hardly be noticeable, if however only the last option managed to recover the data it already took multiple seconds, the exact number depends on the disk. Seagate spec for the Savvio 15K.3 enterprise disk (SAS) reports in section 10.2 that the disk has up to 20 recovery steps on read and 6 steps on write with a maximum time of 1.5 second per LBA. The per LBA point is important, if there is a problem in more than one LBA the times add up as an 8K request where all LBAs require full recovery will take 24 seconds, for this one particular request.

SAS disks have specific options in them to control the maximum time of error recovery, Read Retry Count and Write Retry Count limit the number of steps performed, the Recovery Time Limit in the Error Recovery mode page limits the total time of error recovery for a request. SATA disks provide similar features in some models such as the Seagate Error Recovery Control (ERC), the Western Digital Time Limited Error Recovery (TLER) and the Samsung/Hitachi Command Completion Time Limit (CCTL). It is important to remember that limited the recovery time of the disk improves overall system performance but it increases the chance to get a Media Error response which essentially means that the data in that sector is lost.

Both the SAS and the SATA mechanisms are mostly useful for RAID setups where the data can be recovered from another component of the RAID group, for a desktop disk you’d do better by letting the disk do its full recovery and if it repeats too often assume the disk is dying and replace it.

For SAS disks it is possible to get some sense of what happens if the disk is configured to report recovered errors and the sense reports are parsed for the SKSV part, this includes the number of error recovery stages attempted for the recovery and it may be useful to track this for informational purposes. It may be of limited value though since disks may also use non-linear error recovery procedure, this means that the a step may be skipped if the disk deems it unlikely to be useful. such a dynamic error recovery procedure will also reduce the time to recovery.

SSDs have a similar issue, they will also retry in various forms and the same problem exists for them as well, except that the time scale is much smaller and the full error recovery is unlikely to take more than 10 seconds per block. Enterprise level SSDs also feature an internal RAID structure where they can recover the data even if the block itself is dead, this will cause an even greater disturbance since this will eliminate any parallelism that is possible inside the SSD.