In Search of Zombie Servers
I have a friend who moves a lot. Not because he’s a house flipper or anything like that, but rather he has been so misfortunate in the matrimonial sweepstakes that he has determined that it would be cheaper for him to “just meet a strange woman every four and five years and buy her a house”. As a result of these frequent changes of address he has learned quite a bit about the moving process and is always willing to provide some pithy advice for others in the midst of packing and unpacking, and even recommending the movers from https://www.shiply.com/man-with-a-van/, which he says are the best. Perhaps his most penetrating insight into home migration is that “three moves equal one fire”. In other words, if you haven’t opened a box after loading and unloading it three times from the back of a rented U-Haul, whatever is in it isn’t of use to you anyway and should be discarded immediately. In a sense, I think this kind of describes the current phenomenon of the proliferation of idle servers in today’s data centers.
A recent study from Jonathan Koomey of Stanford and Jon Taylor, a partner with the Athenis Group found that up to 30% of all physical servers in data centers do nothing all day long and no one notices. Although the thought of a bunch of servers operating on a perpetual coffee break is disturbing at any level, the 30% figure offered by the Koomey/Taylor study only verified the same percentage identified in earlier 2008 and 2012 studies performed by McKinsey and Company and the Uptime Institute respectively—coincidence–I wonder. So obviously we have a problem here, and if you’re like me, you’re probably asking, “what’s wrong with servers these days?”
As in any crisis, the first question that needs to be answered is “who to blame?” Naturally, the arrow seems to point to everyone’s favorite whipping boys, the IT department. If these guys didn’t already exist, we’d have had to invent them. Anyway, the rationale seems to break down as follows: the IT group does not have responsibility for the electric bill and it does a bad job of tracking the ownership of servers once they are deployed. For the more legally minded of you, this roughly translates into “ignorance of the monthly power bill is no excuse”, and since “possession is 9/10’s of the law” the good folks that fix your email and gave you that new laptop have no one to blame but themselves.
To be fair, IT organizations do find themselves in a bit of a quandary since a lot of servers just kind of “show up”, to support some long forgotten trial for example. After a couple of years, no one remembers what they actually do. This pattern of benign neglect is then exacerbated by the fact that since no one is sure what a particular server does, the downside of turning it off to find out outweighs any potential corporate savings in the mind of the average data center technician. Let’s face it, no one is going to make you “employee of the year” for finding a zombie server that’s costing the company $100 a month, but you’ll quickly shed your corporate anonymity if your exercise in curiosity brings down the app supporting a few million in corporate revenue.
Apparently, servers that originally were installed for legitimate purposes are only part of this proliferation of servers without a purpose problem. Another big source of “server sprawl” is a phenomenon known as “Shadow IT”. These are the business units—and you know who you are—that go outside of regular channels to buy their own machines. There’s also those dreaded boxes acquired during mergers that are never inventoried or shut down. Based on these examples, it is obvious that this metastasis is best characterized as slow and insidious rather than a full-fledged frontal assault.
In researching how to combat this onslaught of servers on permanent PTO, the literature I found was less than compelling. While I didn’t expect to find any articles expounding on how DCIM is the zombie server “silver bullet”, I was hoping for a little more than what I found. The primary advice offered across the board was that IT should start getting the power bill (good luck with that one) and that they do a better job of tracking server inventory—if they had already been doing that would anyone be talking about this? I also found the usual platitudes about server tracking needing to become a management priority and although this makes sense, until some CIO starts asking for a monthly count of active and idle servers in the data center I don’t think this one is going to show up as anyone’s annual review criteria in the near future. And really, if anyone ever did ask, if the answer was: “we have everything in the Cloud” would anyone ask for verification?
It’s amazing to me that a lot of this work still falls on deaf ears – while industry leaders have long since addressed the issue – one has to remember that the top ten percenters are just a dent in the total IT pie. There’s still a lot of SONET out there for goodness sake! So what is one to do in this battle over which servers are pulling their weight, and which are not, in the data center? At the current time there appears to be no one right answer, and it’s every data center operator for himself. How you decide to deal with this problem is up to you. If it helps, I think my friend’s matrimonially challenged advice provides you with two potential options.