Disk Survey

Surveying the disks of the world

Disk Error Recovery: Attempting Task Abort

| Comments

When a command is sent to a SCSI device (HDD, SSD or even a SAN) the host also sets a timeout, it is normally customary to set this timeout to 30 seconds and that’s likely what you are using unless you set it differently or the application you use does something unusual. The host then tracks the command as it was sent and starts a timer from when it was sent to the device and until it returns. If the command takes longer than the set timeout the host needs to perform error recovery.

Unfortunately for the host it has no way to know if the command was lost in transit and thus never seen by the device or the device had failed completely and will never reply or if the device is simply taking very long to handle this command. It is also possible that a previous command that was sent to the device takes a long time to handle and the command we are currently trying to figure what to do about is still waiting to be executed.

At this point the host has only a few things it can do and they are only hammers and the first one is small but the others are heavy hammers. The options are:

  • Task Abort
  • LUN Reset
  • Target Reset
  • Host Reset

The surgical hammer is the task abort and in my experience it either does the job sufficiently well or the escalation will quickly reach the heavier hammers but with no real resolution of the issue. Which is why I always try to configure things such that the heavier hammers are not reached.

  • Task abort timeout
  • Devices do not really abort anything unless it is pending in the queue
  • If a command is executing task abort will only wait for it to complete and not return the result
  • If this is a read there may be exit points to cancel the command mid-work instead of waiting forever
  • Task abort timeout needs to be set very high to avoid reaching for the heavier hammers

My device shows task aborts, what to look for:

  • Link errors (sg_logs, sg_ses, counters on host)
  • Medium Errors (disk scan) — single io to prevent queueing trains
  • Firmware Problem (consult vendor if above didnt help)