Disk Survey

Surveying the disks of the world

Decoding LSI LogInfo Codes

| Comments

The common LSI SAS/SATA adapters report an error code when an IO fails by the HBA itself. These error codes by their nature are cryptic but do carry some useful information if one can decode them. For example you can refer to my post on LSI LogInfo 0x31080000. Unfortunately LSI are not detailing publicly how to decode these error codes and what they mean, the only open information is what can be found in Linux kernel drivers and from that I built a tool to decode the LSI LogInfo.

The decoding is still pretty cryptic and looks like this:

./lsi_decode_loginfo 0x31080000

Type:       30000000h   SAS 
Origin:     01000000h   PL 

But it is far better than the numeric code itself.

You can find the script at https://github.com/baruch/lsi_decode_loginfo and you can download just the lsi_decode_loginfo.py script file to run wherever you need it. It only needs basic Python.

You can also see all error codes decoded at LSI LogInfo Decodes

Disk Error Recovery: Attempting Task Abort

| Comments

When a command is sent to a SCSI device (HDD, SSD or even a SAN) the host also sets a timeout, it is normally customary to set this timeout to 30 seconds and that’s likely what you are using unless you set it differently or the application you use does something unusual. The host then tracks the command as it was sent and starts a timer from when it was sent to the device and until it returns. If the command takes longer than the set timeout the host needs to perform error recovery.

Unfortunately for the host it has no way to know if the command was lost in transit and thus never seen by the device or the device had failed completely and will never reply or if the device is simply taking very long to handle this command. It is also possible that a previous command that was sent to the device takes a long time to handle and the command we are currently trying to figure what to do about is still waiting to be executed.

At this point the host has only a few things it can do and they are only hammers and the first one is small but the others are heavy hammers. The options are:

  • Task Abort
  • LUN Reset
  • Target Reset
  • Host Reset

The surgical hammer is the task abort and in my experience it either does the job sufficiently well or the escalation will quickly reach the heavier hammers but with no real resolution of the issue. Which is why I always try to configure things such that the heavier hammers are not reached.

  • Task abort timeout
  • Devices do not really abort anything unless it is pending in the queue
  • If a command is executing task abort will only wait for it to complete and not return the result
  • If this is a read there may be exit points to cancel the command mid-work instead of waiting forever
  • Task abort timeout needs to be set very high to avoid reaching for the heavier hammers

My device shows task aborts, what to look for:

  • Link errors (sg_logs, sg_ses, counters on host)
  • Medium Errors (disk scan) — single io to prevent queueing trains
  • Firmware Problem (consult vendor if above didnt help)

Making Sense of SCSI Sense

| Comments

To make sense of SCSI sense I created a knowledge base article on that Understanding SCSI Sense Decoding, in addition I create a web-app to quickly decode the SCSI sense buffer.

One last piece is to actually get the SCSI sense buffer, if you are using Linux you can get it with SystemTap with this STP script:

    probe module("scsi_mod").function("scsi_command_normalize_sense") {
        for (i = 0; i < $cmd->cmd_len; i++) {
            printf(" %02X", $cmd->cmnd[i]);
        printf(" | SENSE:");
        for (i = 0; i < 32; i++) {
            printf(" %02X", $cmd->sense_buffer[i]);

The command to run it is as simple as:

stap scsi_sense.stp

This will output something like:

CDB: 4D 00 40 00 00 00 00 00 04 00 | SENSE: 70 00 05 00 00 00 00 0A 00 00 00 00 20 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
CDB: 4D 00 40 00 00 00 00 00 04 00 | SENSE: 70 00 05 00 00 00 00 0A 00 00 00 00 20 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
CDB: A1 08 2E 00 01 00 00 00 00 EC 00 00 | SENSE: 72 01 00 1D 00 00 00 0E 09 0C 00 00 00 00 00 00 00 00 00 00 00 50 00 00 00 00 00 00 00 00 00 00
CDB: A1 08 2E 00 01 00 00 00 00 EC 00 00 | SENSE: 72 01 00 1D 00 00 00 0E 09 0C 00 00 00 00 00 00 00 00 00 00 00 50 00 00 00 00 00 00 00 00 00 00
CDB: 85 08 2E 00 00 00 01 00 00 00 00 00 00 00 EC 00 | SENSE: 72 01 00 1D 00 00 00 0E 09 0C 00 00 00 00 00 00 00 00 00 00 00 50 00 00 00 00 00 00 00 00 00 00
CDB: 85 08 2E 00 00 00 01 00 00 00 00 00 00 00 EC 00 | SENSE: 72 01 00 1D 00 00 00 0E 09 0C 00 00 00 00 00 00 00 00 00 00 00 50 00 00 00 00 00 00 00 00 00 00
CDB: 85 08 2E 00 00 00 01 00 00 00 00 00 00 00 EC 00 | SENSE: 72 01 00 1D 00 00 00 0E 09 0C 00 00 00 00 00 00 00 00 00 00 00 50 00 00 00 00 00 00 00 00 00 00
CDB: 85 08 2E 00 00 00 01 00 00 00 00 00 00 00 EC 00 | SENSE: 72 01 00 1D 00 00 00 0E 09 0C 00 00 00 00 00 00 00 00 00 00 00 50 00 00 00 00 00 00 00 00 00 00

This shows for each eror the CDB of the command and the sense buffer. It looks like it shows each case twice since the Linux kernel calls the scsi_command_normalize_sense twice for each error but that’s a small pain.

SATA Handling of Medium Errors: Log_info(0x0x31080000)

| Comments

When running SATA disks behind an LSI SAS controller one may encounter an obscure error reporting in the kernel that says “mpt2sas0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)”. If you have more than one SAS controller it may also say mpt2sas1 and mpt2sas2 or even more. The most common errors of this form that the SAS controller will emit are about bad cables or bad ports but this specific one is actually not about a bad hardware, at least not bad SAS or SATA hardware.

This particular error is a side effect of the inability of the SATA NCQ protocol to report the specific IO that had a problem. In a proper SCSI environment such as SAS disks the disk can and does report any error about the specific IO that failed and can continue to handle the other outstanding IOs normally. The SATA NCQ however is unable to do that and once there is any error, and most commonly it will be a Medium Error, it will abort all the other IOs that are pending to the disk and they will need to be reissued by the OS after it had handled the failed IO request.

The result is that when using NCQ there is a severe performance impact caused by this error recovery pattern since not only did the user wait a long time to learn about the medium error, and invariably a medium error is a result of some internal timeout, all other pending requests were aborted and need to be reissued which means wasted time in which the disk could handle more requests.

If the disk is in a proper RAID system the RAID logic will regenerate the data from the parity and rewrite the offending location in order to correct this. If the RAID is not that smart you may want to consider removing the disk from the RAID group to force a rebuild and then reinsert the disk. Preferably after rewriting the entire disk surface and making sure the disk is still fine. It most often will still work just fine after a proper scrub.

It is rather unfortunate that the LSI log_info decoding guide is not provided freely but some hints can be peeked into by looking at the source of the mptbase, mpt2sas and mpt3sas drivers.

Limit Maximum Latency of Multiple Command Queueing

| Comments

One side effect of SATA NCQ and SCSI TCQ is that some commands may be kept in the queue for an extended period of time without being serviced. The disk is optimizing for the closes accesses and if there is a single request at one side of the disk where all the other requests keep coming in at the other side that lonesome request will not be serviced for an extended period of time. The disks have a time limit that can be set and even enabled or disabled. Unfortunately this is a non-standard definition and each disk may do this differently. Hitachi have it in mode page 00 named “Command Aging”. The Ultrastar 7K4000 SAS spec documents the default value to be 2.4 seconds.

This means that for a naive user of the disk some requests may take upwards of 2 seconds and the process which depends on these requests will be starved. An additional distinction is that once we sent a command to the disk it is completely out of control for the OS and it can’t be pulled out and another placed instead.

A simple and yet effective technique to deal with this is documented in a thesis “Native command queuing and latency]thesis” by Lars Sondergaard Pedersen. In it he shows that by simply using a deadline on the requests and not placing new requests once a deadline for some request has expired we can force the disk to honor a lower request timeout, even if we do not know or even cannot change the Command Aging timeout of the disk.

This obviously affects max throughput but you may often need to let multiple different proccesses get their fair share of the disk IO and this may be a viable way to do it.

The deadline scheduler seems to do something relevant to this action but while it is mentioned in the thesis there is no discussion on why the deadline scheduler itself is insufficient to solve the problem.

NCQ Disabled, What Is It About?

| Comments

Sometimes you can find in your logs a line similar to this:

ata3.00: NCQ disabled due to excessive errors

and you are left thinking if this is good, bad or even warranted?

What is NCQ?

NCQ is the name for the disk feature that allows the disk to accept multiple commands in parallel, reorder them and return a response to each IO not in the order they came but rather in the order the disk thinks it’s best from a performance perspecive. In the old days the disk could service only one command at a time and in fact even to this day and with the current hardware at the very lowest level the disk can only do one IO at a time and a lot of work was done to improve performance by giving the disk the optimal order but with the advances in technology and the increased complexities the operating systems and applications went out of sync with the disks and can’t really do the good job needed, instead the task is given to the disk itself since only it is aware of its own internal geometries and redirections to do the best job. Not without fault but it is far simpler to adapt the disk to its new technologies than the myriad of applications.

The end result is that the performance can be near optimal. The disk queue is fairly small, at only 31 requests for a SATA disk so the OS still does it’s own reordering and controls the order by which IO requests are given to the disk but the disk can still do the best job within its constraints.

What can be the reason to disable it?

NCQ didn’t come without it’s own set of problems, initially not all disks have supported it in the best way, some failed in certain conditions when used with NCQ and thus workarounds needed to happen. Linux implemented a three-strikes position where if the disk has three errors on the disk throughout the OS uptime it will drop NCQ due to the niggling suspicion that it may be the culprit. This is a harsh judgement these days as most devices are working very well with NCQ and only fringe devices may fail with it.

One incorrect signal that may be taken to cause the NCQ disabled message is disk timeouts, if a disk is slightly misbehaving it may stall for far longer than the 30 seconds allotted to all IOs by default and in that case a single failure will count towards the ncq disabling punishment. If the IO request is retried to the same location NCQ will get disabled for no good reason as it’s not NCQ that is to blame.

The underlying problem is that if there is a problem with NCQ implementation on the drive than the way it will manifest itself is by the IO never coming back which will show up as a timeout just like as if the disk had a media problem that required many retries to succeed.

What can be the impact of this kernel action?

The main impact of disabling NCQ is performance, on my disks the performance for the disk dropped from around 20MB/s to 3MB/s which is an abysmal rate. It will be pretty painful to continue using the disk in this mode. And it would be advisable to try and avoid this punishment if possible.

What to do to avoid this error?

The main reason I’m aware of that can cause the problem is a media problem on the disk which can cause timeouts and a SATA disk can take upwards of 30 seconds to retry getting the data before it gives up. In fact, some disks have an essentially infinite timeout since their only consideration is to recover the data if possible as it is assumed that a SATA disk is used in a non-RAID configuration and that losing the data is far worse than waiting for it a very long time. Some disks have a TLER/ERC/CCTL option that can prevent the unlimited timeouts to a managable value, it’s worth checking if your disk has it. To check what is the setting for TLER on your drive:

smartctl -l scterc /dev/sda

To control the TLER setting:

smartctl -l scterc,250,250 /dev/sda

If you actually want to get the data and also in general, my recommendation is to increase the setting of the timeouts on the disks, this can be achieved by setting the timeout value in sysfs for the disks, it is easiest to do it by running a command:

find /sys -path '*:*:*:*/timeout' | while read d; do echo 300 > $d; done

This will set the timeout to 5 minutes (300 seconds).

It is also possible to make this the default for any disk and at boot by adding a file to /etc/udev/rules.d with the content:

ACTION=="add", SUBSYSTEM=="scsi", DRIVER=="sd", PROGRAM="/bin/sh -c 'echo 300 > /sys/$devpath/timeout'"

What can be done immediately?

In the immediate time if the disk had its NCQ disabled you should be able to set the queue depth, according to a quick read of the kernel sources it should reset the NCQ_OFF flag and will bring back the performance. It is suggested that you also increase the timeout as above to avoid this happening again.

find /sys -path *:*:*:*/queue_depth | while read d; do echo 31 > $d; done

We use 31 rather than 32 due a libata limiting decision.

Tales From an Adventure With Failing Disks

| Comments

I was going through old archive disks of mine trying to recover lost treasures from past life and load them on a new storage system I installed at home. In that journey I hit on the exact issues I’ve been thinking and working about when working with disks.

All of these disks exhibited one or more problems, all were media issues and all were recovered for all of the important data that I needed. There may have been areas of the disk that were not recoverable but it didn’t impact my work.

The first few disks were old (from 2000 through 2003) IDE disks that I connected through an IDE-to-USB adapter, they had troubles reading the data and reported unrecovered read errors and failed to fetch the data intitially. Based on my knowledge about disk error recovery I increased the timeout of the disks to 5 minutes by tweaking the appropriate setting on the block device in Linux. A retry of the read brought the data, it took time to get all the data since the reads took up to a minute or two in some cases but the data was recovered which was the main concern.

The last disk was a relatively new (2010) Western Digital Green drive of 2TB which I used as a sort of archival data store and it had a lot more problems. It’s latencies were off-the-chart and I increased the timeout to 10 minutes now there were also some unrecoverable read errors but those few were in unimportant spots and were only spotted when I used diskscan and corrected with the —fix option of diskscan. After a sweep of diskscan every spot that was previously with a high latency had come back down to normal and the entire disk drive became usable again. Disk self-test which previously failed had passed again and I lost the chance to experience an RMA process.

All in all, diskscan did it’s work and the increase of allowed disk timeouts got me my data.

I’m now using a udev rule to set the timeout on all my disks to 5 minutes by default:

ACTION=="add", SUBSYSTEM=="scsi", DRIVER=="sd", PROGRAM="/bin/sh -c 'echo 300 > /sys/$devpath/timeout'"

Disk Surface Scan on Linux and Unix

| Comments

I have released today a disk surface scan tool for Linux, I’ve named it DiskScan.

The only other tools that were recommended for use so far were badblocks as a surface scan and fsck as a filesystem checker that will also read most of the disk surface, but not necessarily all of it. badblocks is limited to only telling you when a sector is not reading at all.

DiskScan on the other disk will scan the entire surface, will alert when a sector is unreadable and will also use the read timing to warn when a sector is having problems, as I believe that disk latency is what really matters to know if a disk is good or bad.

This is just the bare start for this project, there is a lot to do here and I hope to also create useful GUI to make it more accessible to those that do not swim in the CLI.

Other Unixes should work as well, the only pitfall I can think of are block device ioctl’s that I’m using that may not be portable but it should be easy to find a working replacement for each of them.

Edit: Created a diskscan project page

Disk Unrecoverable Read Error Specification

| Comments

In a recent entry on Hard Drives and UREs the author depicts the right picture of how a disk may fail to read but I feel that he misses the punchline. Admittedly, it took me a few days to bring it to the front of my mind as well.

All disks give a spec for their Unrecoverable Read Error rate, this is normally $10^-15$ for consumer drives and $10^-16$ for enterprise drives. Many take it to be the overall chance to get a read error and I’m pretty much convinced that this is wrong. It seems to me that this specification is more about the random chance for a disk to fail to read data that was supposed to be written to it beforehand. This includes many possible errors during write and during read. The head may not be able to lock onto the right place, the data may have been overwritten by a later error or any dozen of other possible failures. Many of these failures are being tested during read and write and there are definitely attempts to correct them. A normal SAS drive indicates millions of minor corrective actions that were taken during its operation, most of them are not worthy to note.

The HDD mechatronics and the SSD physics are complex and hard to get right in all cases and that’s where the URE spec comes from, these random failures to read data every now and then.

There is a whole other class of problems where the failure is of a larger scope, the head crashed into the platter, contamination from outside is wreaking havoc, the manufacturing process imbued some contamination or other failure, or some other external force is the source of the problem. These are not covered in the URE spec.

RAID Best Practice: Background Media Scan Monitoring

| Comments

The BMS entry describes in detail what the Background Media Scan does, a corollary from that discussion is that a RAID device should be monitoring the BMS status of all of the SCSI disks (SAS & FC) and perform actions based on the status of that page.

  • If there is a recovered error it should be noted and with a certain probability scanned in the disk scrub to ensure correct data in the disk recovery,
  • If there is an unrecovered error the disk should as soon as possible perform a scrub of the RAID stripe to recover the data from that location,
  • If there are a large number of BMS entries generated the disk should be a candidate for replacement before a real trouble happens.

The BMS log page is of fixed size and if entries are added to it at a faster rate than the monitoring can detect the entries it is possible for entries to be lost and not noticed. There are options that may be used to stop the BMS from rolling over until the monitoring handles the new entries but so far I didn’t see a disk that implements these controls, the developer should be aware of this possibility and possibly alert the user that the disk is too troublesome to be maintainable.