You will have heard of Blackrock, the world’s largest investment house. The company operates in 90 locations in 70 cities across 30 countries. It controls an investment pile of over US$4 trillion on behalf of its clients. It also provides the technology platform for a further $9 trillion of funds that are managed by other asset houses. It is the job of Blackrock to measure risk on behalf of its investor clients and ensure it delivers the best possible return on their investments. For this it relies on technology. Or more specifically it relies on the 25m lines of code of its Aladdin software platform.
That technology platform resides in a series of data centers dotted around the US and the world serving 3,000 analysts and 19,000 users. The firm’s dealers and analysts measure risk in the markets. The company itself measures risk in its data centers.
The global financial markets are characterized by speed. Instant information, faster data feeds, lower latency and, ultimately, immediate returns.
So it is not without irony that inside Blackrock the obsession with speed is decried – “we’re not a latency play” – amid acknowledgement that the technology underpinning its operations – right down to its UPS – is changing rapidly. This led Herb Tracy, the firm’s global head of critical engineering, to say to FOCUS: “Today I wouldn’t design a data center the way I would have even three months ago.”
For an industry that until five years ago hadn’t changed much in decades that’s quite a statement. For an engineer whose job is to guarantee uptime for a company in a sector where risk is everything, it is an acknowledgement of the challenges he faces in criticality, sustainability and efficiency. People, said Tracy, are products of their experiences. If someone (an engineer) did something 20 years ago and it worked, then there was a natural reluctance to change. This is no longer the case.
The cost of energy is the big line item. It is about keeping the mechanical chillers switched off for as long as possible. Last year Blackrock used mechanical cooling for fewer than 40 hours at its West Coast data center.
“We don’t just go out there and measure risk. We manage risk from the beginning stages. In data center terms, measuring risk starts with site selection. The due diligence starts with the routine checks for gas lines, railroad tracks, highways and flight paths. The geography is checked for stability and flood risk. We do a lot of diligence to reduce risk. We don’t build a Tier III or Tier IV level. We evaluate and build the required resiliency for the applications that will be hosted in that particular facility,” Tracy said.
Security of power supply is an obvious requirement but what may not be as obvious is a shift to ‘green’ energy. “Recently we’ve tried to stay away from nuclear and coal powered data centers. Sustainability is vital to our business and the communities in which we work and live. We manage a lot of money on behalf of our clients and we host many large money owners as third party clients on our Aladdin software platform (Blackrock manages $4.3 trillion in assets under management directly and another $14 trillion flows through the Aladdin platform therefore our data centers.) In addition to being an investment manager Blackrock Solutions is a business division which delivers technology services to other large money owners, so we get a lot of questions about sustainability and we take this responsibility very seriously.”
In engineering risk terms, at first glance Blackrock appears to deliver fairly standard resilience set up.
Tracy said as a minimum standard: “We are N+1 on resiliency. We generally have two independent power trains and are N+1 on cooling. We have two independent power trains which have their own generators and UPS. Our independent emergency generator and UPS can handle half of the load out on the A and B cord. We could lose an entire generator line and still have ample cover to run the full operation.”
After the company bought Barclays Capital in 2010 it had 28 data centers around the world. Then began a migration strategy to move to a single platform. Today BlackRock has 11 data centers and the plan is to get to six or eight at some point.
As this consolidation project continues the fleet consists of a mixture of owned and operated facilities and wholesale colocation space (such as at the Sabey campus in Wenatchee, Washington State, pictured).
“On our US team, we have very strong engineering and IT skills which allows most of our data centers to be owner-operated. In EMEA and Asia, we generally set up data centers so that production is colocated and disaster recovery is in owned and operated sites. Our relationship with Sabey is a wholesale lease,” Tracy said. Anything that touches the business is controlled by Blackrock. It maintains a full time staff on the site.
Being in a multi-tenant facility, Blackrock built out its presence at Sabey in pods. To give an idea of the scale of the operation, in 2010 Sabey and Blackrock both sought permission for three 2.5MW diesel-fired generators for a portion of the site. Sabey operates a data center adjacent to the Blackrock facility. Sabey Data Center wanted three 2.5MW diesel-fired generators and VMware, which is inside the same shell, had 10 diesel-fired generators already permitted (2MW each) with only three generators then installed. This is alongside the existing T-Mobile data center located in the adjacent building on an adjacent parcel. The T-Mobile data center gained permission to install and operate up to 20 diesel-fired generators (2MW each).
“Using indirect evaporative cooling and UPSs with highest efficiencies we have been able to run Wenatchee at 1.18 PUE (power useage effectiveness). We are better than industry average now and we think we can beat 1.1,” Tracy said.
Blackrock is at pains to point out that compared with other financial institutions, where the tendency is to run thousands of applications, it tries to move everything to a single platform. “We have one main application platform (Aladdin) and a couple of other supporting ones,” Tracy said.
Generally production is in one location and disaster recovery running concurrently in another. East Coast production might flip to the West Coast – “we have that ability”. The traders and portfolio managers never notice where the workload is running.
In terms of risks associated with the public cloud, nothing is discounted without reason. But Blackrock has a rule that there will be no client information put onto a public cloud. The public cloud might, however, be used for some development work.
How the team is structured
Tracy’s team is built on a global functional matrix. Regional managers are appointed around the world and it uses third-party engineering companies.
“We elected to combine MEP and IT teams so that they are part of one group in our East Wenatchee data center. In a traditional data center the MEP team is responsible for the data center envelope and then hand off to the technology teams. Now we all report to the same business head, the COO. Because of that we avoid any facilities/IT conflicts that can be typical in other organizations,” Tracy said. “The relationship with IT is very collaborative and works very well. How and where we put IT kit in the facility is discussed well in advance of any decisions being made. People with mechanical expertise not only learn the core functions of their own responsibilities but are tasked with being multi-disciplinary. Everyone is required to learn all the skills. For example, we have a switch gear specialist who is required to run fiber and patch servers.”
On construction projects the close control that Blackrock maintains also points to its understanding of where the risk should lie.
“We’re not the type of firm that issues specifications, seeks bids, hires a design engineering firm to build our data centers. We build the preliminary design, we select the equipment. We retain influence, both direct and indirect, over the choice of equipment and oversee the entire process,” Tracy said.
“Now we want to know what is the total maintenance cost for the 15-year life span of the equipment. Lower upfront costs are not the key investment driver. The boxes themselves are more resilient. In the old days it was about screw driver tweaking the system. Now boxes are digital and remotely controlled.”
The criticality of ME is neither going up nor down but the technology is changing rapidly both at a mechanical and electrical and an IT architecture level.
Tracy offered the example of efficiency at various loads on a UPS system. Five years ago the best results were 92% efficiency at 100% load. Now “we have 97% efficiency at 100% double conversion and it doesn’t drop off until below 20% load,” he said.
The Wanatchee site in 2010
Like any other part of any financial services organization Blackrock’s data center operations are subject to external auditing. Maintenance records, repair records and alarms systems are checked regularly. These cover topics such as how individual alarms are communicated, initial response and how incidents get closed out.
The company has an in-house designed ‘data center infrastructure management’ (DCIM) system and Tracy is the man on the frontline. His is the phone that goes off should a problem arise. “Every alarm in the world eventually goes to me after the local teams are engaged,” he said. “If there is something that could potentially impact our availability I will be informed immediately.”
This article first appeared in FOCUS issue 34. To read the full digital edition, click here. Or download a copy for the iPad from DCDFocus.