What is a disk failure?
Most, if not all, of the disk failure studies that I’m aware of do not even define or attempt to expose what was considered as a disk failure. The simple act of the disk being replaced signifies that the disk is considered as “failed” in the study. Unfortunately without a completely controlled study where each disk is accounted for with a log of the exact reasons for its replacement and the diagnostics that was attempted before it was replaced there is little chance to do better than that.
At the end, a disk failure is a completely subjective matter, also evidenced by the relatively high rates of NDF (No Defect Found) reports by the disk manufacturers. It is a constant struggle between the disk users and the disk vendors on the definition and the definitive proofs needed to define a disk as failed. Some of the failures are soft failures where the disk doesn’t satisfy the system design criteria, it may respond too slow or have a high variation in response times. Many times letting the disk rest but powered on would let the disk recover itself by its own mechanisms, but it may not be a viable option to do that when the disk or the slot is needed for actual work.
In the study I am trying to develop I will have to resort to user reports of drive failures as there is no strict control of the disks. I would hope to collect from the users the reason a disk was considered as failed to help improve the possible analysis later on.