Increasing reliance on digital infrastructure for society to function has meant that the importance of data center resiliency has grown accordingly. Whereas 20 years ago, a business may have run a small percentage of its applications online, now, entire models exist only online.
Physical IT systems and the software they support need to remain connected at all times. These operations are fragile, not least because they rely on power and cooling equipment, but if they fail, it can be difficult to reconnect the systems, which are often interrelated, interdependent, and connected in ways that require a lot of manual effort to go back and reset, reconnect, restart. Lost data can mean lost business, which in some cases can mean a massive loss of revenue.
Systems going offline, or downtime, can lead to regulatory scrutiny for failing to provide a reliable service, and, over time, it can damage a company’s reputation.
It was with the idea of ensuring data center resiliency that Uptime Institute’s founder, Ken Brill, set out to create a system that would keep these fragile systems in a reliable state in the face of unreliable conditions at a utility or at a facility layer.
Brill patented the concept of a dual-corded power supply to servers, meaning IT equipment could keep running even with a loss of the primary source of power.
Together with the initial formation of the Uptime Institute, composed of data center owners and operators, he developed the Tier Standard.
In 1996, the Uptime Institute created the Topology Standard, a performance benchmarking system to help data center owners and operators identify the performance capability of the data center infrastructure. Each of the four Tiers identifies the performance requirements for each Tier level.
Importantly, the standard was based on the outcome, and was perfornance-based rather than prescriptive. This means operators are able to adapt their infrastructure to their own circumstances, still using the Tier Standard as a benchmark.
Tier Standard: Topology
Tier requirements can only be satisfied by permanently installed onsite components, so operators and owners can't claim to meet Tier requirements using temporary components which may be removed later.
Also , while utility systems are viewed as economic alternatives, they are not seen as reliable systems for meeting Tier requirements, since they are outside the control of the data center owner or operator - and may be interrupted at any time for any reason.
Tier I - Basic Capacity
To qualify as a Tier I data center, all you need is capacity and infrastructure. The critical infrastructure is non-redundant; it only has to be sized to handle the design load, and no redundant capacity or distribution paths is required. .
Typically located in office buildings, Tier I sites have an Uninterrupted Power Supply (UPS), an onsite power generation system such as an engine generator and a cooling system
Simply put, a Tier I data center has equipment in place to protect its IT systems from utility outages, but not human error, and any maintenance work will mean infrastructure will need to be switched off, inevitably causing disruption.
With only enough equipment in place to support the IT design load, that equipment can help to ensure continued operation during utility outages (power, water natural gas, etc.), but due to the lack of redundancy the infrastructure is still vulnerable to planned and unplanned events (maintenance and equipment failures).
Tier II - Redundant Capacity Components
Tier II facilities have what is called ‘redundant capacity’: ignoring the utility, because it is inherently vulnerable, redundant capacity stipulates that maintenance can be carried out more easily than in a Tier I site and that disruptions are less likely, thanks to the use of, for example, on-site power production, UPS modules, energy storage and secondary cooling systems.
Tier II builds upon Tier I by requiring the addition of redundant capacity components. This can mean a redundant engine generator, UPS, or cooling unit. Tier II still allows a single distribution path. This means that Tier II facilities add the ability to isolate and maintain capacity components without impacting the critical load but unplanned capacity component events along with planned and unplanned events concerning the distribution system can still result in system wide outages.
Tier III - Concurrent Maintainability
Building on Tier II, Tier III adds the requirement of a redundant distribution path. This coupled with the Tier II requirement of redundant capacity components creates the concept of Concurrent Maintainability which means any capacity component, distribution path, or items connected to any critical system can be isolated for planned events (maintenance) without impacting the critical load. This is carried through the IT systems by the use of dual corded IT assets or the use of point of use transfer switches for single corded IT equipment.
Though Tier III data centers are still vulnerable to downtime due to human error or unplanned outages, maintenance doesn’t require IT systems to be taken offline.
Tier III is the standard most operators hold themselves to, as it satisfies the business needs of a wide range of organizations. It provides the ability to perform all maintenance needed, whenever it is needed. Typical designs can provide some limited fault tolerance just due to the nature of how the systems are designed, although fault tolerance is not required. Additionally, Tier III design may inherently contain single points of failure.
Most data centers are built to a Tier III standard because at this level, explains Matt Stansberry, senior vice president, North America, Uptime Institute, maintenance needs to be done often, and a Tier III setup ensures that “you're not putting any of your staff at risk to do ‘hot work’ maintenance.”
“That stuff is super dangerous,” he says. “If you have a Mission Critical IT system that is designed to never go down, that responsibility shouldn't be borne by the health and safety of your staff.”
Tier IV - Fault Tolerance
Tier IV facilities are often referred to as "fault tolerant" and are designed to achieve very high expected uptime, with annual downtime measured in minutes.
Tier IV also has concurrent maintainability: it has multiple isolated systems providing redundant capacity, and multiple, active distribution paths. Any single infrastructure failure is not permitted to impact the critical environment.
Fewer data centers are certified to Tier IV, as many operators do not feel that this level of fault tolerance is required to meet their business goals.
Facilities can be certified against the Tier definitions at various levels.
Uptime Institute services are governed by the Tier Classifications and Uptime Institute is the only organization allowed to adjudicate against its standards.
Tier Certification of Design Documents (TCDD)
The Design Documents certification is ideally undertaken before or during construction, ensuring that a facility has been designed to meet a certain Tier Standard. This will ensure that a project sets off on strong foundations, avoiding what can become costly mistakes further into a data center’s development.
TCDD certification can also be undertaken at any point in the life of a data center.
Tier Certification of Constructed Facility (TCCF)
Intuitively, the Constructed Facility certification verifies whether a data center has indeed been built according to the Tier standard it was designed to, and delivers the performance of that Tier standard.
During construction, any project will experience changes from the original design due to problems with constructability, value engineering or other factors. The TCCF helps to ensure that when the design is not completely followed, the result is Tier compliant and performs to the Tier objective.
History and development
Uptime started certifying against the standard in the early 2000s, focusing initially on the US and Western European markets, but now the Tier Standards are established in more than 215 countries.
The Standard volume was (and still is) available to consult for free, providing one signs a memorandum of understanding. So while it became the norm for data centers to be built to meet Tier standards, many firms didn’t go down the certification route.
It was only later, during the digital boom of the 2010s, that certifications took off globally, explains Stansberry:
“There were construction projects going on all over the world with people who might have 50 different trades on one site. They wanted to have oversight over these builds to make sure they were being done properly and meeting a certain performance standard.”
Today, the main role of Tier standards and certification is quality assurance, to catch errors in data center designs.
Such errors are inevitable, says Stansberry, because “the projects and processes are too detailed and there are too many people with different KPIs that are officially impacting a project, from cost-cutting to scheduling to permitting to just finding room in a design to get things done they want to see done.”
Despite best intentions, inevitably, he says: “Every project has mistakes. Every project has problems. We're not going through and giving people 100s, it doesn't happen. Even in the best designed, best planned data centers in the world, we’re constantly finding errors and mistakes.”
“That's just to say, these are really complex projects, so that quality assurance is really important.”
What then? Beyond Tier Topology: Operational Sustainability
To consider how a data center is actually run, Uptime created a new Tier Standard for Operational Stability. This is administered through services including Tier Certification of Operational Sustainability (TCOS).
The Operational Sustainability standard uses a score-based operations assessment to evaluate how effectively a data center is staffed, what its internal training looks like, and how reporting, accountability and internal organization are carried out.
Together, the two lenses of Topolicy and Operations are intended to help an organization to design and run a facility which meets its overall business needs, ensuring that a well-designed facility is also well-run.
Operational Sustainability standards also serve to compare operations across multiple data centers, and, if used as a guide, should help operators run their facilities more consistently.
The are three foundational elements within the Tier Standard for Operational Sustainability: Management and Operations, Building Characteristics and Site Location.
Management and Operations
The ‘Management and Operations’ element is identified as the most influential in maximizing the potential availability of the data center, as, according to Uptime, most outages are attributable to management, staff and procedural shortcomings.
It looks at staffing, the effectiveness of a facility’s maintenance program including preventative maintenance and service level agreements, and how rigorous the staff training program is.
It is worth mentioning that management and operations are the most readily changed category, so attention here can be the easiest way to have a positive impact on the operation of the facility.
This analyzes the practicalities, checking how thorough an operator’s commissioning program is, and how well documented they, for instance factory witness testing of critical infrastructure.
It also looks at building features, with a view to assessing how well they set the operations staff up for success.
For instance, this elements looks at whether a facility has dedicated space for disaster recovery, for meetings and training, and for tool storage.
Lastly, the site location assesses the risk of natural disasters like flooding or seismic activity, but also man-made risk, like proximity to a chemical plant or a civil or military airfield, and how well prepared an operator is to mitigate those risks.
Data center risk assessments (DCRA)
To assess the sustainability of a live operating data center, Uptime Institute offers a service that combines the Tier Standard: Topology and Tier Standard: Operational Sustainability.
Data center risk assessments, are intend to assess the capabilities and conditions of an existing data center, and how well it can serve the needs of the organization.
It can be used to identify shortcomings, or as a due diligence item, or as part of a program to improve the facility's operations.
The assessment looks for vulnerabilities and single points of failure, assesses a data center’s capacity, arrangement and the condition of the equipment in situ, as well as a facility’s maintenance regime, staffing, and training programs.