Interpreting S.M.A.R.T. data from hard drives

Introduced in 1995 on all IDE, SATA, SSD and SCSI hard drives is a feature described as S.M.A.R.T., an acronym for self monitoring and analysis technology. This is a set of data that is accumulated and stored inside a hard drive to evaluate the performance and history of the drive. As problems or events occur inside the disk drive, they are recorded as SMART data in a reserved area on the hard drive.

SMART data can be displayed using many different free software utilities. There are no built-in utilities included in any version of Windows to display SMART data. Recommended free utilities that can readily be found using Google.com include SpeedFan and HDtune.

Most utilities will offer a choice of raw data, typically in hexidecimal (base 16) format and regular decimal (base 10) format. Raw hex is useful for determining when a single 8-digit SMART field is being used for split high/low counters, where the first four digits represent one attribute and the last 4 represent either a separate attribute or a low threshold. This can help properly interpret decimal numbers that are unusually high.

For example, some drives report temperature as a range rather than a current value, but when the double-word hex is converted to decimal, it becomes a single large number that is meaningless.

Every drive manufacturer uses a slightly different set of SMART data and descriptions. These variations create a challenge when interpreting SMART data. For example, most hard drives report power-on time in hours, but some drives report it in minutes and a few use seconds.

Many computers will perform a startup check of the SMART history on a hard drive, comparing the current SMART data against a set of pre-defined thresholds. If any attribute exceeds a defined threshold, a warning is displayed.

For example, on a Seagate 40gb IDE hard drive, the relocated sector threshold is 50, so when there are more than 50 relocated sectors, the computer will display a SMART warning on startup. While the drive will continue to operate, the user should arrange to copy all files and replace the failing hard drive.

Power-on time is the most accurate method of defining the age of a hard drive, and can be useful for determining when a drive should be replaced. Most hard drives are considered to be at a higher risk of complete failure as they approach 50,000 power-on hours. This is equivalent to 25 years of 9-5pm weekday usage, or over 5 years of continuous 24×7 operation.

The next most useful data attribute is relocated sectors. This counter indicates how many sectors were unusable and relocated. Every hard drive has a fixed number of spare sectors, typically between 50-100 spare. This allows the drive to tolerate a small number of bad sectors without reducing the total capacity.

Once the number of spare sectors has been used, the drive will allocate more sectors to replace bad sectors, but the total capacity of the drive will be reduced.

The re-allocation process occurs when the drive is unable to save data successfully into a sector. When a bad sector is discovered during a write, it is marked unusable and the data is saved into a spare sector.

Whenever bad sectors are present on a hard drive, the best practice is to perform a full read-write test across the entire hard drive to discover any new bad sectors. Hard drives with bad sectors should always be tested twice and checked to see if the bad sector count increases. If the count increases after the second complete test, the drive is unreliable and failing and should be replaced.

Occasionally on Western Digital drives with relocated sectors, we see these numbers return to zero after performing a full read-write test of the entire hard drive. This occurs when a previously relocated sector is re-tested after re-writing the entire 512-byte sector with a data pattern. The pattern may force the drive to re-establish the data on the drive correctly, and if it succeeds in writing and reading the sector, it will be returned to use as a good sector.

Seek and ECC errors are a less useful SMART statistic because of the differences between manufacturers. For example, every Seagate hard drive will display unusually high seek and ECC errors, typically 100 million or more.

In fact, after testing over 1,000 different Seagate hard drives of all sizes, we have never found a Seagate hard drive that did not have very high seek and ECC errors. While Seagate Technology does not offer an explanation for their high seek and ECC errors, its likely that they are transparently reporting the results of data being read before PRML (partial result maximum likelyhood) is applied.

PRML is Seagate’s proprietary method for statistical analysis of the data signal returned by the drive as it is converted back into the binary data stream. If true, then their ECC and seek numbers would lower and more meaningful if they would report them after PRML is applied.

However, Western Digital hard drives rarely ever report non-zero seek and ECC errors, and when they do show seek or ECC errors, the drive is already failing and reporting other errors.

Hard drive temperature is another value reported by SMART on every drive. Typically, we consider 50 deg. Celsius to be the upper limit for operating temperature. It is unusual and problematic if a drive is reporting its temperature over 50 deg. C., indicating either poor cooling or an overheating drive that may have a failing motor or circuit board.

Some manufacturers, including Seagate and Western Digital, report uncorrectable sectors separately from relocated sectors. While a relocated sector indicates a bad sector that has been replaced by a spare, an uncorrectable sector is one that has no spare sector. Uncorrectable sectors are problematic, and should be considered severe enough to warrant replacement of a drive.

One of the limitations of SMART data is that it is based only on the drive sectors that have been read and written. On a typical hard drive, not all space gets read or written. Often, there is a significant amount of unused empty space that may contain bad sectors but is untested. These bad sectors do not become apparent until the drive eventually uses the space, and then fails when attempting to use a sector for the first time.

The solution to the partial use problem is to perform annual full read/write testing on a hard drive. While read-only testing the entire drive will discover sectors that are unreadable, it is possible to pass a read test but fail a write test.

Aside from SMART data, we also recommend listening closely to the sound of the hard drive. An audible high pitch whine from the drive motor is a sign of wear that will lead to failure. This is commonly heard from hard drives that are 40gb or smaller, since they use ball bearings inside the motors. Larger hard drives use silent fluid dynamic bearings.

Another sign of hard drive problems is revealed when performing a full read test. A properly working drive should advance rapidly and smoothly through a read test without delays. A failing drive will frequently pause or stutter as the drive relies on repeat reads or error correction to properly read the data.

Another limitation of SMART data is that drives can develop problems that are not counted by the SMART data. For example, if the fluid inside the motor bearing is lost, the disk platter will settle and grind against the disk read-write head. This disk shift value is reported only on Hitachi hard drives, so a Seagate or Western Digital drive with a failing fluid dynamic bearing will not provide any failure warning.

In conclusion, a healthy hard drive should be checked annually, and should run cool and quiet with fewer than 40,000 power-on hours and no relocated sectors.

This entry was posted in Computers and tagged , , , , . Bookmark the permalink.

One Response to Interpreting S.M.A.R.T. data from hard drives

  1. Ernest Seinfeld says:

    Very interesting, useful and comprehensive article.
    Thanks.
    ES

Leave a Reply