Datacenter Too Hot? Open the Windows
In the not so distant past, computers and other IT equipment were very expensive. To protect that multi-million dollar investment, everything was locked up in a ‘glass house’ with filtered air and a massive air conditioning system to keep everything running cool. Electricity for cooling was cheap and servers were very expensive. As a result, companies tended to upgrade and repair rather than replace servers. IT managers could not afford a hardware failure because that would mean downtime and the associated loss of access to business critical applications and data.
Now fast-forward to 2008. The price of electricity is rising as rapidly as the price of hardware is plunging. Most datacenters are now based on inexpensive industry-standard servers made from off-the-shelf commodity components. Even the most expensive processor only costs a few thousand dollars and devices like muffin fans can be obtained for under $20. Rather than a monolithic server, groups of smaller servers are interconnected to form a computing grid and sophisticated software moves applications around the grid as needed. Storage functions in a similar way with SANs. Instead of keeping servers and storage forever, IT equipment is replaced every five years or so with new models that are infinitely faster and more efficient.
With all of this cheap equipment that is only used for a relatively short time, why are we still cooling datacenters? Why not turn off the air conditioning, open the windows, and just let the datacenters run? That would save a huge amount of money on electricity. Warm air could be blown outside with a fan in the summer and recycled to keep office space warm in the winter. When the question was posed to multiple IT vendors around the industry, the surprising reply came back "We are looking into just that." Obviously, the issue with IT equipment is the mean time between failure (MBTF) goes up when the temperature goes up and few of us want our servers and storage device to be more prone to failure. Let’s take a look at what happens when the air conditioning is shut off. Temperatures inside the box generally climb to around 40C (104F) and stay there, and local fans keep hot components from getting too hot. The MBTF of a server, based on the combined MBTF of all components, drops from an average of seven years to five years. But if servers are only being used for five years, that’s not bad.
Of the individual components, the processors are the most robust components at high temperatures. Virtually all processors today have over-temperature monitors that will throttle back the clock or shut down cores when a temperature threshold is exceeded. Memory DIMMs are likewise quite robust in the heat, but there will be slightly more DIMM failures. Current memory fault-tolerant methods appear to be sufficient to keep servers running when a DIMM fails, and replacement DIMMs are much cheaper to purchase three years later. Power supplies can also take the heat and back-up supplies will take over transparently when one fails. Disk drives are an area that could see a drop in MBTF, although it may not as severe as one would expect. It may be possible to put the drives into a SAN and provide enhanced local cooling through something like a large refrigerator. Solid state drives may be a solution for the drives within a server. There is also the option of redesigning the disk drives to push the MBTF out to 7-10 years. Like memory failures, disk failures are not catastrophic because of RAID protection. That leaves the lowly muffin fan as the most likely component to fail within the first five year of use. With redundant fans, a simple fan failure is not likely to bring down the server.
When the numbers are finally crunched, the result shows the energy savings from turning off the air conditioning are significant and the risk to uptime is relatively small. There is a corresponding benefit to the environment as well because the electricity for cooling does not need to be generated. It may be possible to take existing servers and put a "service by date" sticker on them. When that date is reached, they either get pulled and discarded, or pulled and rebuilt with a few new components. Of course for many applications such as web serving, you could just let the servers fail in place and replace them. Even mission-critical applications could be maintained in such an environment using virtualization and automatic failover. Just think about it. Servers may only require a new set of fans at 48 months to keep them humming for their entire useful life, just like servicing your car at 60,000 miles. That’s exciting news for everyone!



