Disk Survey

Surveying the disks of the world

Background Media Scan

| Comments

The worst fear for a storage device is the inability to read the data when needed, RAID arrays have long learned to perform disk scrubbing in order to preemptively detect areas that have problems and recover them from the RAID information before the problem is manifested during a forced rebuild due to an actual disk failure. Background Media Scan (BMS) is the equivalent for a single disk to try and keep its data in a better shape.

The BMS process works at idle time, when the disk received no commands, a common setting is to perform the BMS after about 500ms of not getting any command. The disk may be able to perform this work in a loaded system but it will take much longer, if the disk senses it really needs to perform background tasks it may reduce the required idle time to gain a higher chance of doing it, in that case it may use idle times of around 100ms and then go off to do its background tasks. For the BMS the disk goes sequentially over the media, reading the data but with a slightly reduced ECC tolerance so that it can find locations that are having problems but are still readable. If it hits a problem spot it has two options, if it can recover the data with the full ECC it will do so and attempt a rewrite to the same spot, it will also verify that it can re-read the data and if not it will declare the sector bad and perform a reallocation. If it cannot read the data at all it has no real recourse, it will mark the sector as needing reallocation and if the next access to this location will be a write rather than a read the sector will be reallocated.

In addition to recovering the data the disk will also log the sector and report its status, this can be used to diagnose a disk and see if it starts to gather a larger error rate in which case it will be possible to eject it from the system. If the disk is in a RAID it is possible to direct the RAID to recover the unreadable sectors by the RAID redundancy and then write it back into the same location triggering the much needed reallocation. This can decrease the chance for an unreadable sector due to another disk failure, thus reducing the risk of data loss. Obviously, if the disk is not part of a RAID and there is an unrecoverable sector, the data is pretty much lost.

One side effect for the BMS is that it may cause the disk to get stuck on error recovery procedure while the user wants to service a new command. The error recovery procedure is not stopped in mid-work so this may cause a high latency for the first command after an idle period.

The above is all for SAS drives, SATA is far less documented in that regard but there are mentions in the standard for a Continuous Background Defect Scanning (CBDS) that does pretty much the same thing but with less reporting (no BMS log page equivalent in SATA) and less control.

Drive Drops Off the RAID

| Comments

A common complaint for drive failures in RAIDs is that the drive dropped off the RAID for no apparent reason, the drive is then tested seems functional and then it gets added to the RAID and works just fine in the RAID with no immediate issue. This may then repeat with the same drive every once in a while and is quite a bother. It is a bother since the RAID array now needs more maintenance and it seems to go into a degraded mode for what seems to be as no real reason.

While it is impossible discount a bug in the RAID software or firmware it is more than likely that the fault lies with the disk itself. The disk is likely to be having problems with some locations on its media and when it tries to perform disk error recovery, such retries can take a lengthy time and the result is that the RAID decides to drop the disk from the array for being unresponsive. The reason for this decision by the RAID is that there are IOs that the RAID received, transferred them to the disk and the disk didnt respond yet, this means that the RAID itself cannot respond to the external IO and so it risks causing problems upstream. The simplest thing the RAID can do is to just drop the disk from the RAID, go into degraded mode and recover the lost data from the RAID redundancy.

The result of this is that the disk will gain some quiet time that will let it do whatever background process it can to recover the bad media location and then when it is actually tested later on it will seem to be working very well and there will be no apparent reason not to add it back to the RAID.

The risk of adding the disk back into the RAID is that if there will be a failure in another device on that RAID and this semi-problematic disk fails later on shortly before that other failure manifests itself you risk data loss or a severe service loss from a double failure. You also increase the risk of hitting another media failure in the process of reading the data from the redundancy of the RAID while the RAID is in a degraded mode due to this disk failure.

In most cases a disk that begins to exhibit such problems is on a declining trend and is likely to fail sooner rather than later and it is best to remove it from the RAID and possibly only using it in an unimportant facility.

What a strong RAID could do instead of just dropping the disk is to keep it partially in the RAID and still let it rest, maybe deliver only writes to it or perhaps mark as bad blocks the area around which the problem manifests. That requires a very good deal of record keeping and guesses which makes it harder to make it work reliably but it is doable.

One thing that disk drive makers do to help the disks survive in the RAID is to limit the time the drive will attempt error recovery, each maker uses a different name and acronym, they are TLER/ERC/CCTL.

How Micron Makes Their SSDs

| Comments

An interesting ad that shows the steps in the manufacture of an SSD from flash chip to completed product:

Crucial® SSDs are designed and developed by Micron, one of the largest NAND manufacturers in the world. This means four things: hundreds of SSD qualification tests, over a thousand hours of prerelease validation testing, 1.5 billion dollars invested in R&D, and more than 30 years of industry expertise. For more information, go to www.crucial.com

What to Ask About an SSD?

| Comments

The following are a list of questions that I have in my mind when considering an SSD, the performance of an SSD is important and there are good and bad SSDs in that regard, but the part that worries me is the behind-the-scenes RAS (Reliability, Availability, Serviceability) issues.

Data Path

  1. How fast is it in read and write?
  2. What is the internal data-path design tidbits that will enable getting the maximal parallelism from the disk?
  3. How many IOs in parallel can be maintained? (queue depth)
  4. What is the smallest optimal block size and what is the alignment? Where does the limitation come from?
  5. What should be the optimal trim usage?
  6. What IO request size is atomic?
  7. What is the variability in response times? What are the causes for the variability?

RAS

Some of the RAS questions are on the border of the data-path but they take care of things that application developers tend to ignore.

  1. What errors might be returned? In what conditions? What needs to be done for each error?
  2. What are other hard failure modes that are not covered by simple error sense codes? Drive failing to start? Drive asserts?
  3. What is the error recovery procedure of the disk?
  4. What is the smallest unit of work for ERP?
  5. How long can the ERP take?
  6. What is the expected distribution for ERP times?
  7. What is the impact of an ERP action on other IOs? (delay until all ERP is finished? prevent parallelism?)
  8. What happens on interface errors (initiator disappeared)?
  9. How long do the capacitors hold the SSD?
  10. How long does it take for them to get charged enough?
  11. How is it possible to know that an hold-up is due to capacitors charging?
  12. How and when are the capacitors tested during operation? How do we know they were tested ok or failed?
  13. How is the NAND evaluated during operation? When does a NAND block get retired?
  14. Is there an internal RAID structure to recover from a dead NAND block? What will be the ERP impact of such a recovery?
  15. What is the SMART logic based on? When will a SMART trip be triggered?
  16. Is the SMART trip fatal or are there warning trips as well?
  17. How often to poll for the SMART information? Any possible notifications from the disk about a change in SMART?
  18. What is the impact of a SMART request or a LOG SENSE request? Anything that might take longer? Prevent normal data-path accesses in parallel?
  19. What information to collect when a failure is detected? Any vendor specific logs that may help debug the issue?
  20. What sort of recovery will happen if power is lost unexpectedly? How long will it take? How will we know that it is taking place? And what will happen if it fails?
  21. What sort of background activities happen on the disk?
  22. What is the impact of the background activities on latency? Are they stopped when a user IO is received or will they delay the user IO?
  23. How can the background activities be monitored?
  24. How can I inject errors into the drive to test the whole hardware and software stack? Should be able to inject media errors, hardware errors and latency as well as a mix of them (media with latency).
  25. What is the data protection inside the device? Needed to avoid corruption of data inside the disk.
  26. What are the external abilities for data protection (T10-DIF? 520 sectors?)

Disk Latency Is What Really Matters

| Comments

The best indication for a disk problem brewing or happening is the latency of disk requests. Normally the disk will have a fairly consistent, though noisy, latency figures that depend on a lot of factors. There will however be a general tendency to stay in some ballpark area that is the normal behavior. A disk with a small problem will have a once-in-a-long-while higher latency that is indicative of a random problem. If however the disk exhibits a repeating pattern of high latency requests it is definitely an indication for a problem with the disk. Oftentimes the problem will start at a very slow rate and increase in the problem rate until a complete breakdown of the disk. Depending on the usage model it may be a good idea to catch this early on and in any case even if the decision is made to keep this disk in place until it actually dies it is worth monitoring the overall system situation, if there are several disks in the same RAID group that exhibit problems you may find yourself against a data loss situation with multiple-disk failures.

Due to the customer pressures the disk vendors are doing a lot of work to avoid returning an uncorrected error reply, due to that the latency numbers may shoot to very high numbers that are even two magnitudes above the normal latency (a normal hdd latency is averaged at 500msec and up to 3 seconds, an error recovery could take 30 seconds or more depending on the disk and its configuration).

That said it is worth checking and comparing the disk to nearby disks, it is possible for disk latency to shoot up due to a nearby noise or other external source such as a minor quake or fire extinguishers operation. By comparing a disk to its nearby peers it is possible to avoid false alarms that are based on external factors that affected all disks at once.

This post was triggered by a post by Chris about his recent problems with disks.

Disk Question: How Many Bad Sectors Means a Bad Disk?

| Comments

This is mostly a question to answer with the disk survey, but I do wonder how many bad sectors mean that the disk is very likely to go bad soon. Obviously there needs to be a definition or finding of what “likely” means, what “soon” means and what “bad” means to have a useful answer but these are definitions that depend on the user. For a desktop user who cares about his data I’d say that the numbers are should be a 30% chance that the disk will either smart trip or have an unrecoverable sector in the next 3 months. For a server system with RAID I’d say that the threshold would be an 80% chance for the same problems in the next month.

I really should get going with the implementation of the disk survey to start getting the data for this.

How Storage Layers Hurt

| Comments

There’s always been a war between the layer makers and the monolithic lovers, it’s ever present in the networking model with its multitude of specifically defined layers and it also exists in the storage world, whether the storage is networked or not doesn’t matter. There is a lot of simplicity in the upper layers when the lower layers take control and do something behind the scenes. The problem comes when the lower layers do things without being aware of what can be done in a higher layer and there is no way to communicate this information between the layers.

Take for example a presentation by Micron about their ClearNAND, titled “Why ECC-Free NAND Is the Best Solution for High-Performance Applications” it describes a small NAND controller below the SSD controller that deals with the ECC for itself and thus reduces the need for the increasing size of ECC handling in the FPGA/ASIC of the top controller of the SSD. The first thought that comes to mind is how smart handling of slightly deteriorated NAND can be performed if the lower level hides the information? How would something like the Anobit (RIP) or DensBits smarts come to life if all the upper layer could get is either good data or forever corrupt data?

The same happens at the higher layers, a typical RAID device has multiple devices with redundancy and is capable of recovering from errors in less time than it would take a disk to retry and yet there is no method to communicate the problem from the disk (SSD or HDD doesn’t matter) and let the upper RAID level to handle the problem and only return to the disk with a request to do more work if it’s higher level recovery failed to work. Performance of storage devices would increase even in the face of media problems, and the world will be better for it. But the laziness of some software developers at the higher layers and the inelasticity of the developers at the lower layers prevent it so far. The defined interfaces between a host and the disk do not help much either.

There is a lot to be said for the ultimate control that integration such as Fusion-IO does with their products, working their way from NAND to the top application. And yet there is a world of difference between a caching product and aa full blown SAN storage device that makes the life of administrators much easier.

There should also be a middle-ground, I wonder if it will ever come to life?

RAID Best Practice: Write After a Long Latency Read

| Comments

As discussed in the disk error recovery procedure page when a disk has a marginal error recovery work it may not result in a check condition with sense, it may only be noticeable by a long latency effect on the read. The proposed handling of this is either to ignore it or to rewrite the data at the same spot. This is a similar thinking to the other best practice of writing over a medium error with the thinking being that if there is a problem at that spot a rewrite will enable the disk to fix the problem by writing over it or reallocating the sector to another place that will hold the data better.

A long latency in this case needs to consider also queuing delays and the disk may be free to reorder requests. Dealing with this reordering is non-trivial and in different cases I have witnessed queuing delays due to reorder of two to three seconds.

There are a few caveats though to this practice:

The IO that returned with a relatively long latency may not have been delayed by an error recovery procedure at all. It could have been mere queuing. This could be handled by measuring the time of the IO not from the time of submission but by the time since the last IO returned, this should give a number that is much closer to the actual IO service time.

Another reason why this may not be the problem space is the case of a background process that took over, needed a lengthy error recovery and only then this IO was serviced. Unfortunately there is no real way to discern this. In a SAS disk it is possible to query the Background Media Scan log page to see if something happened recently  but it’s not a sure way and the LOG SENSE request takes time as well.

Despite the drawbacks it should be a very useful method to reduce the number of unrecoverable read errors by fixing problem spaces before they become too much of a problem.

As with any possible error it is also important to log this and keep statistics about such work in order to enable higher-order analysis to consider if a disk is going bad.

RAID Best Practice: Write Over a Media Error

| Comments

When a disk reports a media error on read it merely means that the read failed, it is hard to judge by that if the location is actually bad or if the media has deteriorated at this location. This is what the Hitachi Ultrastar 5K3000 spec had to say about this:

9.9.1 Auto Reassign Function

Non recovered read errors

When a read operation is failed after defined ERP is fully carried out, a hard error is reported to the host system. This location is registered internally as a candidate for the reallocation. When a registered location is specified as a target of a write operation, a sequence of media verification is performed automatically. When the result of this verification meets the criteria, this sector is reallocated.

Hence the right thing to do is to recover the data from the RAID parity and write it over, the disk is then tasked with writing the location and verifying that the write worked fine, if it did than we are good, if not, the disk will reallocate the data at that location and move it to one of the spare locations and it will be noted by an increase in the number of reallocations.

It would also be a good idea to use a WRITE VERIFY command to enforce and be sure of a proper write in this case as this is already a suspect location.

 

Paper: Google’s Failure Trends in a Large Disk Drive Population

| Comments

Google’s seminal paper on “Failure Trends in a Large Disk Drive Population” has shown everyone that what was widely believed beforehand is not necessarily true. A short recap of the paper is available from Google’s Disk Failure Eperience StorageMojo for the impatient. The nail in the coffin of the SMART myth that it will help detect near-failing disks and save us from the multiple-disk failures hazard of RAID. It had shown the way that needs to be followed, collecting real data from enough systems to make it possible to draw real conclusions.

Unfortunately, there was very little follow on. In my work I hope to get into the large shoes of Google and attempt another, and this time an open attempt to collect the data and make it widely available for learning.