As is the case with any software, tools aimed at data center operation and management are directly dependent on the data used to describe both areas. The "old" rule of IT, "garbage in, garbage out," still applies.

If, as we do at TycheTools, we consider the basic goals (mission-critical, efficiency, and sustainability), we find that systems of this complexity require working with a much larger volume of data than what has been used until now.

To ensure that this large volume of data unleashes its tremendous potential, the tools must be specifically designed to cover a series of aspects that we want to share through this white paper, based on our own experience in building ECOaaS.

How much data?

Today, we describe four areas that are sources of data for the O&M of an IT technical room: cooling, electricity, IT, and thermal management. These areas are interconnected, which means that a larger amount of data provides a better description of the relationships than a smaller amount of data.

For example, one data point per day, let's say at 00:00, for the return temperature on a cooling unit, along with one data point per day at the same time for the power of that same unit, does not describe the operation of that unit in the same way that 500 temperature data points and 500 power data points would.

If we use an arbitrary rate of one data point every 10 minutes, we will have 6x24x2 = 288 data points in a day (144 for return temperature and 144 for power in W or kW.) However, it's not enough to have just return temperature, as the relationship between return temperature and power is non-linear and doesn't provide the necessary elements to describe the operation of the four interconnected areas.

For example, T inlet (the temperature at which cooled air enters the IT cabinets) provides more useful information. So, we could add another 144 data points for each T inlet sensor, assuming one T inlet sensor for each cabinet in a room with 20 cabinets.

Heatmaps_ECOaaS - TycheTools
– TycheTools

In total, 144x20 = 2,880 data points. But since T inlet depends, among other things, on T supply (the temperature at which the cooling unit supplies cold air), we add one T supply sensor (144 data points). Oh, right, this operation is mission-critical, so it's not reasonable to assume there's only one cooling unit; there will be at least two. So, we have 144 data points for each sensor; with three data points (T return, T supply, P) for each cooling unit and one for each cabinet (T inlet). In total, 864 for the cooling units and 2,880 for the cabinets, 3,744 in total.

But, in reality, this still describes only a small part of the operation; it doesn't tell us much about IT except for T inlet, and that's not very accurate. If we follow ASHRAE recommendations, we should actually describe T inlet with three sensors (thus 5,760 more data points.)

Since what we're most interested in is describing the operation of IT (the only reason we build these rooms), we need more data, either through power measurements (via PDUs) or indirectly (which requires T outlet sensors, i.e., the temperature at which the air exits after passing through the IT equipment.)

If we use the first method, we have to assume that cabinets have two power branches, so we add 40 sensors (5,760 more data points). Or we can imagine that there is no power measurement in the cabinets, in which case we use another 60 T outlet sensors to describe IT operation, meaning another 8,640 data points.

This means that we have a total of 18,144 data points per day to describe the temperature and power variables of 20 cabinets. If we wanted one data point every five minutes (what we call "granularity" at TycheTools), it would be 36,288.

But there are other variables we need to control; temperature and power are not enough. We also need to know what happens with relative humidity (HR), and we want to use another variable that allows us to indirectly describe the airflow, which is critical in terms of operation. So, we would be talking about 108,864 data points per day for a room with 20 cabinets and a 10-minute granularity. Or, in other words, 39,765,360 data points per year.

This doesn't include the fact that once we have the data for the basic variables, we want to convert that data into other data that describe other important O&M variables, both from a mission-critical and efficiency standpoint, such as PUE. The volume of calculated data from the "measured" data is actually much larger.

This is just the beginning. In reality, the operational data we have described only covers a very basic part of the operations. For example, the operational data for IT is orders of magnitude larger than what we have described so far. This data already exists, although it is only sporadically used. Therefore, describing the O&M of a critical IT room requires much more data than we have considered so far.

Quality data?

Since the description of the operation and management of a data center is a structured construction, with measured data at the base and calculated data derived from it, it is evident that the quality of the base data is critical for the description to be sufficiently accurate. But the "quality" of the data is based on three different axes that must be taken into account.

First, the "quality" of the measurement itself. Obviously, a margin of +-10 percent is much worse than a margin of +-0.2 percent.

Furthermore, there are variables whose error margins are themselves variable; for example, the power measurements commonly used in rPDUs have different error margins depending on the power being measured; in general, lower powers have much larger error margins.

There is also a time axis that must be considered since measurement sensors must maintain the accuracy of their measurements over time (their lifespan must extend over many years), which implies that we must somehow ensure control/recalibration procedures, or the data quality will degrade over time.

Finally, we have the axis of the coherence of the quality of the dataset as a whole. With many sources of continuous data in operation, it is reasonable to think that errors in data transmission will occur in some cases. Therefore, any strategy for working with this volume must ensure real-time "validation" to ensure that "doubtful" data is not introduced into the calculations of higher-level variables, or a cascade effect will amplify the inaccuracy of the description of the operation.

Usable data?

O&M data is used in two different directions. On one hand, for the actual operation; we want our room to operate safely, efficiently, and sustainably. On the other hand, there are management elements that must be covered; improvement plans, technology changes, system architecture changes, or as is beginning to happen currently, changes in regulations related to technical room operations that also increase the compliance needs of organizations.

But the enormous volume of data generated in the O&M of a room also poses a challenge from the data usability perspective; it's great to have millions of data points, but how are they used? When trying to visualize the data in some way, even in smaller volumes, several problems arise.

Firstly, although some form of aggregation is commonly used (for example, daily data, typically using averages of the measured values of each variable), it is necessary to note that, while a useful tool in certain cases, aggregations hide reality.

It must be kept in mind that in operations, it is critical to evaluate multiple interrelated variables and react immediately, which is impossible to do if aggregations are used, as all aggregations lose sight of the temporal evolution of the variables. Although aggregations are widely used, they are actually more elements of management than of operation.

On the other hand, a significant problem arises from the point of view of the dataset to be displayed. If the time scale is large, multiple data points end up practically overlapping, making fine analysis impossible.

If the time scale is short, the ability to identify recurring elements that are of great value in operations (e.g., IT loads that increase during certain periods that could be optimized or that could pose a risk to room operation) is lost.

This requires the construction of visualization windows with enormous flexibility, but it is obvious that the more data points we use, the more calculations we require, which can lead to unacceptable response times.

Finally, with so many observable data sources, the problem arises of which sources to select to analyse a particular behavior; you cannot use all of them at once because the "trees" (individual measurements) do not allow us to see "the forest" (the operation.)

Since the operation must also analyse the interrelationships between the different variables, there is an additional problem in determining which "trees" are to be select.

Because of all the above, the existence of multiple interrelated variables implies that it is impossible for a human operator to monitor room operation through an interface 24 hours a day (aside from being tremendously inefficient.)

Although the need for a flexible interface is undeniable, the reality is that the continuously changing data volume implies that any system that aims to be minimally efficient must primarily work in the background, automatically, without operator intervention except in exceptional cases of issues that cannot be solved by the system or in cases where a configuration change is desired.

In every sense, it is an "autopilot" system. The "messaging" layer communicates the situation to the operator and, if necessary, the need for intervention.

Universal data?

The case of operations in multiple rooms in multiple locations raises the issue of data coherence on a larger scale. In both extensive infrastructure cases, such as a telecommunications network or railway infrastructure, and in cases of distributed IT operations using multiple cloud centers, it is predictable that the existing data sources will be diverse; that is, different rooms will have different systems that supply data.

However, this data must be consistent with each other; for example, power data must have similar or identical error margins so that total data is not a sum of "apples and oranges," and all operations must send data.

To achieve "universality" of the data, it is essential to obtain data through calculations using indirect variables, that do not depend on any equipment details and that allow controlling the different operations of each room to be managed consistently.

This also allows all operations (all rooms) to have the same consistent database, regardless of how, when, what size, where they are, and with what equipment and technologies they have been built.

Without this ability to use indirect variables and therefore universalize the data, it is practically impossible to consider a coherent description of the operations of extensive IT infrastructures. Without data coherence, there is no reliable description of operations, and therefore management becomes practically impossible.

Conclusions based on data

The operation and management of mission-critical technical rooms require a volume of measured, interconnected, and extrapolated data that goes far beyond what has been used until now.

This enormous volume of data means that the tools that must be used to unleash the tremendous power they contain must be specifically designed to generate and manage these volumes of data as effectively as possible, from all perspectives. The development of our O&M ECOaaS solution has led us to understand these problems and solve them.

This is the data challenge.