Which are more reliable: hard disks or SSDs?

Friday, June 3, 2022 7:50 AM

However else they might differ, hard disks are cheaper for any given capacity than SSDs. Two main reasons often given for preferring SSDs whenever possible are that they’re faster, and more reliable. This article considers the latter question.

Failure modes

Hard disks are elaborate electro-mechanical devices within which one or more platters coated with magnetic particles are spun at high speed. They’re written to and read by heads which are moved just above the surface of the platter. They are thus subject to failure in many ways, including the integrity and function of the magnetic particles, physical contact between a head and platter (literally a crash between them), and the electric motors positioning the heads and spinning the platter. These all deteriorate with use, and result in local errors or bad blocks, and eventually complete failure of the disk.

Like most electro-mechanical devices, hard disk failure rate is generally thought to follow a U curve, with a high rate of failure in the first few weeks of use, declining to a low rate which is sustained until the disks start to wear out.

Solid-state disks contain no moving parts, and fail differently from hard disks. They can develop bad blocks during use, and some studies have claimed that their rate of development may be as high or higher than in hard disks. However, the main cause of failure is thought to occur when their memory has reached its limit of being erased and written to again, when its write endurance has been reached.

Unfortunately, different studies have resulted in conflicting claims. For example, it’s generally thought that each block has a finite write endurance, so the amount of data written to an SSD is a critical determinant in when it will fail. However, other studies claim that device age is more important than write endurance, which is contradictory.

SSDs normally follow a form of U curve, with high initial failure rates, followed by a long period in which failure is uncommon. If write endurance is the main determinant of failure, then those low rates should continue until their write endurance is reached, at which point failure becomes inevitable.

Thus failure modes and lifetimes for hard disks and SSDs are completely different. Wear and tear, the cause of most failure in hard disks, occurs over time, relatively independently of the amount of data written to the disk, but is adversely affected by events such as spinning up the platters. For SSDs, write endurance can be consumed rapidly when large quantities of data are written repeatedly to small SSDs, or very slowly when they are almost entirely used for reading data.

For both types of storage, early failures appear to differ little. Rates of failure during the bottom of the U curve, the working life, are normally low and appear to be little different. Most important is the turning point at which the failure rate climbs rapidly with advancing age. Users should aim to replace that storage just before that occurs.

Conditions of use

Hard disks have been used in different situations. Until ten years ago, most desktop Macs came fitted with an internal hard disk. Most users ran up that hard disk when they powered their Mac on each day, and many also put the hard disk to sleep at various times before shutting down. Apple switched internal storage in MacBook Pro notebooks to SSDs at about the same time. Prior to that, their hard disks were often spun up several times each day.

RAID and NAS systems are more likely to be left running for prolonged periods, spinning hard disks up far less frequently, but commonly reading and writing greater quantities of data. Unlike data centres, though, most working environments are less controlled in temperature, long considered to be a significant determinant in hard disk life.

Studies

Storage manufacturers have some of the most extensive data on hard disk and SSD failure rates, but don’t publish it as a rule. However, they do publish claims for their products, often in terms of ‘mean time between failure’ (MTBF). Typical figures given by major manufacturers are of the order of a million hours, which could be interpreted as indicating that on average their products should last over a century. Despite those, most storage media are only given 2-3 year warranties, and ‘enterprise’ quality storage with longer warranties is now unusual and costly.

Among the more useful studies are those produced by Backblaze, the cloud service provider, over the last seven years. Although those provide valuable insights into data centre use of specific makes and models of hard disks, the disks used and conditions of use are very different from the great majority of home and office users. While they say a great deal about the reliability and working life of the specific models they use, they can’t be used to draw more general conclusions about failure rates or expected working life for other applications or models.

Comparable large scale studies of SSDs are generally older, smaller, and for different conditions of use. For example, the last survey from Backblaze covered only three years experience with SSDs, almost all of 500 GB or less being used to boot and run their storage servers. Most of the SSDs used are from Seagate, and the number of failures reported is too low to provide any useful comparison with their data from hard disks.

Given the size and value of the storage market, it’s shocking how little information on failure rates is available.

Write amplification

Given the importance of write endurance in determining SSD working life, one factor which appears unexplored is the use of SLC write cache in most modern SSDs. Essentially, this uses some of the normal SSD blocks in SLC mode, which writes data at high speed and low density. Once that has been filled, the SSD then copies the written data in slower time to other blocks in the SSD at normal density. This effectively turns each write into a sequence of two writes: write amplification.

I’m not aware of any studies that have been made to assess the effects of write amplification on SSD working life, although SMART reports giving the total data written to an SSD should account for this, so monitoring the wear of your SSD using that figure should remain accurate.

Conclusions

Although good estimates don’t exist, the failure rates of hard disks and SSDs during their expected working life appear low, and not significantly different. Greater differences are seen between different manufacturers, models and batches.

Hard disks will become significantly more likely to fail after three years, but some will retain low failure rates for seven years or more. It’s impossible to predict which will last reliably into old age.

In theory at least, SSDs should have a longer working life before failure unless they’re subjected to abnormally high write rates consuming their write endurance rapidly. Working life could extend as long as ten years, but that will vary according to model and use.