Modern Enterprise HDDs are designed for operating temperatures between 5 and 60 degrees Celsius. Manufacturers recommend that they should not be operated at the upper end of this range on a permanent basis because doing so will reduce the lifetime of the drives and pose a risk of higher failure rates. So what happens to HDDs at high temperatures? And can these effects be compensated for later by operating at lower temperatures?
Like most components in servers and storage systems, hard drives get warmer in operation, especially under heavy load. To enable administrators to monitor the temperature of their drives, modern HDDs have an internal temperature sensor that delivers its readings via SMART (Self-Monitoring Analysis and Reporting Technology), so that they can be read using on-board operating system resources, system management tools, or the tools for managing RAID controllers and host bus adapters. In addition, there are a whole host of specialist tools for this task, such as the open-source licensed smartmontools, available for both Windows and Linux.
If hard disks get too hot, they no longer work correctly because the electronic and mechanical components only function correctly within a certain temperature range. On top of this, the mechanical components wear out more quickly, resulting in reduced reliability and service life. In particular, the bearing of the spindle within the hard drives is at risk, because at high temperatures the oil used as lubricant becomes too runny and can leak out of the bearing. It is therefore essential that the temperature of the hard drive is monitored to prevent overheating and ensure that the drives provide long and reliable service.
What is the optimum temperature?
The manufacturers of hard disk drives usually specify a temperature range in which their drives operate correctly. In the case of Enterprise HDDs, they assume using in air-conditioned server rooms or data centers, which is why these types of drive are designed for operating temperatures between 5 and 60 degrees Celsius. The specifications for NAS HDDs are 5 to 65 degrees Celsius and surveillance HDDs are 0 to 70 degrees Celsius because systems for video surveillance are not always set up in rooms with stable ambient conditions.
These specifications are really only about the operating capability, but durability is definitely adversely affected when drives are operated in the upper temperature range for a longer period of time. A brief temperature increase, for example, when a fan in the system has failed and must be replaced, can usually be tolerated, but even permanent operation at 45 degrees Celsius can cost the hard drives a few months of lifetime. After all, the Mean Time To Failure (MTTF) specifications in the manufacturers' data sheets always refer to an average operating temperature of 40 degrees Celsius.
An interesting point in this regard: average actually means that operating times at more than 40 degrees Celsius can later be compensated for by operating for a time at a correspondingly lower temperature. In practice, however, it is highly unlikely that HDDs first spend months or years at high temperatures and then the same amount of time at lower ones.
Temperature increases, reliability falls
A typical Enterprise HDD has an MTTF of two and a half million hours. In other words, in a case of two and a half million drives, one failure per hour would be expected, or in a case of 1,000 drives, one failure every 2,500 hours. Since this information is not particularly intuitive for estimating the failure probability of hard disks within one's own infrastructure, the annual failure rate (AFR) is usually used, which can be calculated from the MTTF. The formula for this is as follows: AFR = 1-e(-8,760/MTTF)*100, where 8,760 are the annual operating hours for the 24/7 operation which is standard for Enterprise HDDs.
In this formula, the drives that have already failed are considered when calculating the AFR for the remaining drives. However, this is not necessary for low failure rates such as is the case with hard disks, which means that the formula can be simplified: AFR = 8,760/MTTF*100. The resultant AFR for Enterprise HDDs with an MTTF of 2.5 million hours is therefore 0.35 percent. Where 1,000 drives are used, three to four of them can be expected to fail each year.
If the average operating temperature of the hard drives is above 40 degrees Celsius, the failure rate increases. As a rule of thumb, for every 5 degrees above 40 degrees Celsius, the failure rate can increase by 30 percent. At a permanent HDD temperature of 55 degrees Celsius, the AFR should roughly double, so an installed base of 1,000 drives would probably see six to eight HDD failures per year.
Temperature is not the only factor
In addition to temperature, other factors affect the durability of hard drives, including annual workload (Rated Workload), guarantee period, and, in the case of drives not designed for 24/7 use, operating time. This does not mean there is an immediate risk of failure if the specified values are not observed, or if the HDD continues to be operated after the guarantee period has expired, but the AFR increases so that more than the expected number of HDDs per year fail over time.
Correct thermal design and cooling
In systems which are thermally well-designed and which are accommodated in air-conditioned rooms, there should normally be no problem keeping the hard drive temperature at 40 degrees Celsius or lower. Without air conditioning, it can be difficult because, in the summer months, the temperature in rooms often exceeds 30 degrees Celsius. This means that inside servers and storage systems, temperatures above 40 degrees Celsius are quickly reached. In addition, the warm exhaust air from the systems is difficult to remove without suitable ventilation, resulting in an inevitable increase in the room temperature and, consequently, the systems heat up even more.
It is therefore always better to operate server and storage systems in an air-conditioned environment - especially if top loaders with several dozen HDDs are used. For design reasons, the rear hard drives become warmer than the front ones, because the air flow absorbs the heat from the front drives first and is therefore no longer capable of cooling the rear ones quite as effectively. In this case, air intake temperatures of less than 20 degrees Celsius are required to keep the HDDs in the rear rows below 40 degrees Celsius on a permanent basis.
If the hard drive temperature is permanently more than 15 degrees Celsius above the air intake or ambient temperature, there is something amiss with the thermal design of the system. In this case, administrators need to check whether fans are working correctly or if the air flow reaches the drives without hindrance. In addition, the room as a whole needs to be designed so that cold and warm air do not mix, because this reduces cooling efficiency. This is why racks are usually positioned opposite each other. The cooling air is supplied in the middle where it meets the front of the units and is drawn in to cool the system components. It absorbs heat in the process and then comes out again at the back of the units, where it is removed by fans. Covers on empty trays prevent the warm exhaust air from flowing back into the cold aisle.
To ensure that hard disks function correctly and last as long as possible, administrators need to continuously monitor their operating temperatures. Even though drives are designed for up to 60 degrees Celsius, it is essential to avoid this maximum value. Operation at an average of no more than 40 degrees Celsius is ideal. Ensuring that this temperature is not exceeded depends primarily on the thermal design of the system and the cooling concept of the room in which the system is accommodated.