Disk Survey

Surveying the disks of the world

Background Media Scan

| Comments

The worst fear for a storage device is the inability to read the data when needed, RAID arrays have long learned to perform disk scrubbing in order to preemptively detect areas that have problems and recover them from the RAID information before the problem is manifested during a forced rebuild due to an actual disk failure. Background Media Scan (BMS) is the equivalent for a single disk to try and keep its data in a better shape.

The BMS process works at idle time, when the disk received no commands, a common setting is to perform the BMS after about 500ms of not getting any command. The disk may be able to perform this work in a loaded system but it will take much longer, if the disk senses it really needs to perform background tasks it may reduce the required idle time to gain a higher chance of doing it, in that case it may use idle times of around 100ms and then go off to do its background tasks. For the BMS the disk goes sequentially over the media, reading the data but with a slightly reduced ECC tolerance so that it can find locations that are having problems but are still readable. If it hits a problem spot it has two options, if it can recover the data with the full ECC it will do so and attempt a rewrite to the same spot, it will also verify that it can re-read the data and if not it will declare the sector bad and perform a reallocation. If it cannot read the data at all it has no real recourse, it will mark the sector as needing reallocation and if the next access to this location will be a write rather than a read the sector will be reallocated.

In addition to recovering the data the disk will also log the sector and report its status, this can be used to diagnose a disk and see if it starts to gather a larger error rate in which case it will be possible to eject it from the system. If the disk is in a RAID it is possible to direct the RAID to recover the unreadable sectors by the RAID redundancy and then write it back into the same location triggering the much needed reallocation. This can decrease the chance for an unreadable sector due to another disk failure, thus reducing the risk of data loss. Obviously, if the disk is not part of a RAID and there is an unrecoverable sector, the data is pretty much lost.

One side effect for the BMS is that it may cause the disk to get stuck on error recovery procedure while the user wants to service a new command. The error recovery procedure is not stopped in mid-work so this may cause a high latency for the first command after an idle period.

The above is all for SAS drives, SATA is far less documented in that regard but there are mentions in the standard for a Continuous Background Defect Scanning (CBDS) that does pretty much the same thing but with less reporting (no BMS log page equivalent in SATA) and less control.