The following are a list of questions that I have in my mind when considering an SSD, the performance of an SSD is important and there are good and bad SSDs in that regard, but the part that worries me is the behind-the-scenes RAS (Reliability, Availability, Serviceability) issues.
- How fast is it in read and write?
- What is the internal data-path design tidbits that will enable getting the maximal parallelism from the disk?
- How many IOs in parallel can be maintained? (queue depth)
- What is the smallest optimal block size and what is the alignment? Where does the limitation come from?
- What should be the optimal trim usage?
- What IO request size is atomic?
- What is the variability in response times? What are the causes for the variability?
Some of the RAS questions are on the border of the data-path but they take care of things that application developers tend to ignore.
- What errors might be returned? In what conditions? What needs to be done for each error?
- What are other hard failure modes that are not covered by simple error sense codes? Drive failing to start? Drive asserts?
- What is the error recovery procedure of the disk?
- What is the smallest unit of work for ERP?
- How long can the ERP take?
- What is the expected distribution for ERP times?
- What is the impact of an ERP action on other IOs? (delay until all ERP is finished? prevent parallelism?)
- What happens on interface errors (initiator disappeared)?
- How long do the capacitors hold the SSD?
- How long does it take for them to get charged enough?
- How is it possible to know that an hold-up is due to capacitors charging?
- How and when are the capacitors tested during operation? How do we know they were tested ok or failed?
- How is the NAND evaluated during operation? When does a NAND block get retired?
- Is there an internal RAID structure to recover from a dead NAND block? What will be the ERP impact of such a recovery?
- What is the SMART logic based on? When will a SMART trip be triggered?
- Is the SMART trip fatal or are there warning trips as well?
- How often to poll for the SMART information? Any possible notifications from the disk about a change in SMART?
- What is the impact of a SMART request or a LOG SENSE request? Anything that might take longer? Prevent normal data-path accesses in parallel?
- What information to collect when a failure is detected? Any vendor specific logs that may help debug the issue?
- What sort of recovery will happen if power is lost unexpectedly? How long will it take? How will we know that it is taking place? And what will happen if it fails?
- What sort of background activities happen on the disk?
- What is the impact of the background activities on latency? Are they stopped when a user IO is received or will they delay the user IO?
- How can the background activities be monitored?
- How can I inject errors into the drive to test the whole hardware and software stack? Should be able to inject media errors, hardware errors and latency as well as a mix of them (media with latency).
- What is the data protection inside the device? Needed to avoid corruption of data inside the disk.
- What are the external abilities for data protection (T10-DIF? 520 sectors?)