Sometimes you can find in your logs a line similar to this:
ata3.00: NCQ disabled due to excessive errors
and you are left thinking if this is good, bad or even warranted?
What is NCQ?
NCQ is the name for the disk feature that allows the disk to accept multiple commands in parallel, reorder them and return a response to each IO not in the order they came but rather in the order the disk thinks it’s best from a performance perspecive. In the old days the disk could service only one command at a time and in fact even to this day and with the current hardware at the very lowest level the disk can only do one IO at a time and a lot of work was done to improve performance by giving the disk the optimal order but with the advances in technology and the increased complexities the operating systems and applications went out of sync with the disks and can’t really do the good job needed, instead the task is given to the disk itself since only it is aware of its own internal geometries and redirections to do the best job. Not without fault but it is far simpler to adapt the disk to its new technologies than the myriad of applications.
The end result is that the performance can be near optimal. The disk queue is fairly small, at only 31 requests for a SATA disk so the OS still does it’s own reordering and controls the order by which IO requests are given to the disk but the disk can still do the best job within its constraints.
What can be the reason to disable it?
NCQ didn’t come without it’s own set of problems, initially not all disks have supported it in the best way, some failed in certain conditions when used with NCQ and thus workarounds needed to happen. Linux implemented a three-strikes position where if the disk has three errors on the disk throughout the OS uptime it will drop NCQ due to the niggling suspicion that it may be the culprit. This is a harsh judgement these days as most devices are working very well with NCQ and only fringe devices may fail with it.
One incorrect signal that may be taken to cause the NCQ disabled message is disk timeouts, if a disk is slightly misbehaving it may stall for far longer than the 30 seconds allotted to all IOs by default and in that case a single failure will count towards the ncq disabling punishment. If the IO request is retried to the same location NCQ will get disabled for no good reason as it’s not NCQ that is to blame.
The underlying problem is that if there is a problem with NCQ implementation on the drive than the way it will manifest itself is by the IO never coming back which will show up as a timeout just like as if the disk had a media problem that required many retries to succeed.
What can be the impact of this kernel action?
The main impact of disabling NCQ is performance, on my disks the performance for the disk dropped from around 20MB/s to 3MB/s which is an abysmal rate. It will be pretty painful to continue using the disk in this mode. And it would be advisable to try and avoid this punishment if possible.
What to do to avoid this error?
The main reason I’m aware of that can cause the problem is a media problem on the disk which can cause timeouts and a SATA disk can take upwards of 30 seconds to retry getting the data before it gives up. In fact, some disks have an essentially infinite timeout since their only consideration is to recover the data if possible as it is assumed that a SATA disk is used in a non-RAID configuration and that losing the data is far worse than waiting for it a very long time. Some disks have a TLER/ERC/CCTL option that can prevent the unlimited timeouts to a managable value, it’s worth checking if your disk has it. To check what is the setting for TLER on your drive:
smartctl -l scterc /dev/sda
To control the TLER setting:
smartctl -l scterc,250,250 /dev/sda
If you actually want to get the data and also in general, my recommendation is to increase the setting of the timeouts on the disks, this can be achieved by setting the timeout value in sysfs for the disks, it is easiest to do it by running a command:
find /sys -path '*:*:*:*/timeout' | while read d; do echo 300 > $d; done
This will set the timeout to 5 minutes (300 seconds).
It is also possible to make this the default for any disk and at boot by adding a file to /etc/udev/rules.d with the content:
ACTION=="add", SUBSYSTEM=="scsi", DRIVER=="sd", PROGRAM="/bin/sh -c 'echo 300 > /sys/$devpath/timeout'"
What can be done immediately?
In the immediate time if the disk had its NCQ disabled you should be able to set the queue depth, according to a quick read of the kernel sources it should reset the NCQ_OFF flag and will bring back the performance. It is suggested that you also increase the timeout as above to avoid this happening again.
find /sys -path *:*:*:*/queue_depth | while read d; do echo 31 > $d; done
We use 31 rather than 32 due a libata limiting decision.