Disk Survey

Surveying the disks of the world

Disk Latency Is What Really Matters

| Comments

The best indication for a disk problem brewing or happening is the latency of disk requests. Normally the disk will have a fairly consistent, though noisy, latency figures that depend on a lot of factors. There will however be a general tendency to stay in some ballpark area that is the normal behavior. A disk with a small problem will have a once-in-a-long-while higher latency that is indicative of a random problem. If however the disk exhibits a repeating pattern of high latency requests it is definitely an indication for a problem with the disk. Oftentimes the problem will start at a very slow rate and increase in the problem rate until a complete breakdown of the disk. Depending on the usage model it may be a good idea to catch this early on and in any case even if the decision is made to keep this disk in place until it actually dies it is worth monitoring the overall system situation, if there are several disks in the same RAID group that exhibit problems you may find yourself against a data loss situation with multiple-disk failures.

Due to the customer pressures the disk vendors are doing a lot of work to avoid returning an uncorrected error reply, due to that the latency numbers may shoot to very high numbers that are even two magnitudes above the normal latency (a normal hdd latency is averaged at 500msec and up to 3 seconds, an error recovery could take 30 seconds or more depending on the disk and its configuration).

That said it is worth checking and comparing the disk to nearby disks, it is possible for disk latency to shoot up due to a nearby noise or other external source such as a minor quake or fire extinguishers operation. By comparing a disk to its nearby peers it is possible to avoid false alarms that are based on external factors that affected all disks at once.

This post was triggered by a post by Chris about his recent problems with disks.