When I was asked to write this article, my initial thoughts were to look at something big and exciting; some bad airflow management practice that caused a catastrophic failure. You, dear reader, would be entertained and come away having learned a lesson, safe in the knowledge that you wouldn’t make the same mistake yourself.
But, it occurred to me that actually this approach wouldn’t actually be that informative. Why? Because the majority of cooling problems that data center operators face aren’t the result of some large failure in the system. Rather, they are the culmination of lots of small issues that slip through unnoticed. Each too small on their own to cause any headaches, but each chipping away at the resilience and usable capacity of the cooling system. So, that’s where I will focus my attention in this short text.
The following is an example of one of these small issues. It comes straight from one of our own customers and, while it’s not the most exciting, it shows how these things creep in.
The client was installing new network equipment in a facility using cold aisle containment, and using simulation to test the plans. Experience had taught them that the design of the racks and leakage paths were critical to the performance of the containment system. With that in mind, they’d gone to considerable effort to ensure that leakage paths were eliminated as far as possible. The racks had been designed with blanking and seals attached to all four sides of the front mounting rail to prevent hot air coming from the back side through to the front. The racks also sat on small 20mm legs so the blanking at the bottom extended that extra distance to cut off that particular leakage path.
For the networking infrastructure, the client had specified front-to-back breathing core and distribution switches to mitigate airflow issues. However, enterprise SAN equipment was only available in side-to-side configuration from their preferred vendor. To house the SAN switches, the cabinets were given the appropriate modification and ductwork added to ensure a front-to-back style operation.
Given all this attention to detail on the blanking, it was something of a surprise to see that the maximum temperature entering one of the distribution switches was 5C above the air entering the cold aisle!
Working in the virtual world allowed us to easily visualise the temperature distribution at the switch inlets, identify the location of the hot spot and trace the air supply.
For the core switches, all of the link cables needed to enter through a large cable penetration at the front of the rack. This required the mounting rail to be set back by 500mm from the front of the rack, 400mm further back than the mounting rail in the racks housing the SAN switches. This placed the exhaust of the SAN switches in front of the mounting rail in the core networking racks.
The side exhaust of the SAN switches hitting the side panel of the rack caused a local build-up of pressure. Hot exhaust air was pushed down towards the bottom of the rack, towards the gap under the side panel. Because the intake of the bottom core switch was behind this exhaust air, some was pulled under the side panel and into that core switch! Luckily, though the cause was reasonably complex, the fix was easy: simply install some extra baffling under the cabinet side and block the path for air to travel.
All of this action took place behind the front door of the cabinet, so the sensors mounted on the door still read the air temperature as 25C (not surprising seeing as they were half a meter away from the switch inlets!). The net effect was that without the simulation this would not have been picked up, and a little error would have been allowed to creep into the system.
The fate of the distribution switch would be tied to that of the SAN switch. Any increase in outlet temperature of the SAN switch (say, due to increased throughput or increased inlet temperatures in a CRAC failure) would result in the temperature of the distribution switch also increasing.
The real point here is that it would no longer have been in the operator’s direct control – losing control is not what you want in a mission critical facility.
This is where simulation comes into its own, letting you visualise, understand and fix these small problems before they become critical, helping you stay in control of your airflow in a way that is not possible with monitoring alone.
Text and images © 2015 Future Facilities.
 As an aside, there was some discussion before this happened about placing the sensors on all the racks behind the door closer to the IT inlets. The mid-height sensor could be placed on the mounting rail itself, but the bottom one would need to be mounted on one of the side panels. In this case a sensor mounted on the left side of the rack would have picked up this up, but one mounted on the right still would have missed it!