Intel's 18 core Grantley hits the data center

Archived Content

The following content is from an older version of this website, and may not display correctly.

For the last few weeks, it has been the worst-kept secret in all of computing, partly because Intel gave the world its product roadmap as early as last January, but due in large measure to HP, Dell, and Cisco having announced their new datacenter servers over the past two weeks. Some even mentioned the name “Intel E5-26xx v3” in their product brochures, without actually having had the clearance from Intel to say more.

Now, the thinnest veil we’ve ever seen has been lifted. Now we can talk about the engine that will power those servers.

“This is our workhorse mainstream product. When we thought about designing it, we thought about performance costs and energy efficiency,” stated Eoin McConnell (pictured right), Intel’s product line director for the E5 v3 series, up to now code-named “Grantley-EP.”

“It’s certainly going to enable us to start delivering our vision defined last year around software-defined infrastructure.”

Version 3 of E5 takes an incremental, though not monumental, step forward overall. But in certain departments, it makes some long-awaited breakthroughs. Considered as a whole, the result is that these partners and others can now produce classes and form factors of server that they could not produce with E5 v2. They can put their brands on servers that have more distinctive, perhaps even exclusive, features that they couldn’t have achieved with v2.

As with every other incremental improvement Intel makes — the “tock” in its hugely successful “tick-tock” marketing plan — E5 v3 has its share of “faster,” “bigger,” and “lower-power” improvements. When you take these improvements into account collectively, v3 may cross a threshold that could change the game for datacenters in 2015.

Support for DDR4 memory
“We’re going to have the first server platform with DDR4 memory,” proclaimed Intel’s McConnell.“We’re working collaboratively with this memory with a lot of our customers, and we’re going to ensure we’ll drive this transition together.” More than any other single advance, the fact that Xeon E5-2600 v3 series supports the newest form factor of enterprise-class memory module, will change the way servers look, act, and work. The DDR4 class has been waiting in the wings for years, having been pioneered and even produced by Samsung.

- DDR4’s maximum component density has doubled over DDR3 to 8Gb

- DDR4’s bit rate has been boosted by 50% over DDR3 to 3.2 Gbps.

- DDR4’s standard voltage has been lowered from 1.5V to 1.2V.

DDR3’s memory bandwidth had been capped at 1866 MHz since early 2011, although 1600 MHz modules remained prevalent. Meanwhile, DDR4 modules have been produced for over 2 years with bandwidths at 2133 MHz.

The “Advanced series” of E5 v3 will support 2133 MHz out of the gate, with five models starting with the 10-core, 2.3 GHz E5-2650 v3, working up to the 12-core, 2.6 GHz E5-2690 v3. Nearly all of the “segment-optimized” series (high-performance, high-frequency, and workstation), except for the 4-core models, support DDR4-2133, including the high-frequency 6-core, 3.4 GHz E5-2643v3, and the top-of-the-line 18-core, 2.3 GHz E5-2699 v3.

Every advancement in server memory architecture in the last three years has been designed for DDR4. But until today, there hasn’t been an Intel x86 server processor that would support it.

“As most people know, the memory vendors have stopped investing in DDR3 in terms of new capabilities,” admitted McConnell. “DDR3 caps out at an 1866 [MHz] memory bandwidth speed. So we’re excited about the new bandwidth improvements we can get with DDR4, and we’re also looking at significant power efficiencies.”

A network of 18 cores
We’ve known since the onset of the multicore era that processor cores are not linearly scalable. After a certain number — which engineers in the last decade reliably pinned at 8 — simply stacking more cores onto a die could either yield no performance improvement or actually impose a performance cost.

So linear scalability is not the answer long-term. For Intel to make a processor more capable by stacking cores, it has had to change the architecture with which cores are stacked. With E5 v3, that change runs deeper than you might have expected.

Although we call Xeon E5 v3 an “18-core” processor, to be truthful, there’s just one model (“SKU”) where all 18 are present: the 2.3 GHz, 145W TDP E5-2699 v3. Nevertheless, each of the die configurations in the v3 series is based on an 18-core design that may be scaled down for lower cores.

The 18-core, 2.3 GHz Xeon E5-2699 v3 is arranged somewhat differently from the 12-core, 2.7 GHz Xeon E5-2697 v2 (no, that’s no typo, the newer model has the slower clock speed). Without rethinking the microarchitecture of the core itself, Intel has changed the on-die interconnects — the way the cores, the cache memory, and the memory controllers interact. In early tests, some of which this reporter has eyewitnessed, the performance gains appear appreciable, slightly outdistancing the levels of previous “tock” improvements.

DatacenterDynamics will present a special article devoted to just this topic this week, because it is the beginning of a directional shift in the entire concept of multicore, away from linear stacking. How you assess the performance requirements for your datacenter servers will change from this point forward, because one 18-core processor is not the same as one-and-a-half 12-core processors.

For now, here’s a summary: With E5 v3, Xeon processors with 8 or more cores are moving to an interconnect architecture based on twin buffered rings. Think of them like a pair of superhighway loops, one which connects with the east side of “downtown” and one with the west. Along the way, they connect with the last-level memory cache (LLC), two memory controllers, and on-ramps to two home agents (HA) for the QuickPath Interconnect internal bus and to the PCIe peripheral bus.

“Part of it is about trying to balance your resources, and feed the beasts,” admitted Chris Gianos (pictured right), one of Intel’s lead engineers for Xeon E5 v3. “So with the fully-buffered rings, we can feed more bandwidth to each core. The previous microarchitecture was going to run out of steam at the higher core counts. This gets us to 18, and will serve us [going forward].”

Per-core power management
The dream of every datacenter manager is fully addressable power regulation, which for cloud scenarios means controlling power consumption anywhere on the planet. What is changing is the meaning of “fully” in this context.

One of Intel Core Microarchitecture’s superb advancements to date has been its variable p-states — the ability to trust the processor to reduce its own power levels when its workloads are closer to idle. “Idle” has always been something of a misnomer; processors always process something, and when they’re officially processing nothing, an “idle process” gives them something to do.

Lowering the processor’s p-state means it can do this “thumb twiddling” process with reduced power, at the very least. Already, variable p-states have changed the game for load balancing, enabling power savings payoffs when workloads are distributed among CPUs or among servers.
Beginning with E5 v3, this distribution capability extends to a much more granular level, with per-core p-states (PCPS).

“Depending upon which cores are actually being tasked at a particular point in time by the workload or the application, we’ve got the ability where all of the cores don’t have to operate at the same power level,” stated Intel’s McConnell. Effectively, each of the 18 cores in E5 v3 is on its own individual, variable-level, voltage domain.

Ever since the days of Pentium Pro, Intel has placed some features of the processor platform onto the package, sometimes before they’re moved to placement on-die. With E5 v3, the voltage regulators are moving on-die. This actually raises the thermal design point (TDP) of the processor a bit (the general measure of how much energy is required to cool it), from 130W in the mid-range of the series to 145W, and workstation-specific models jumping from 150W to 160W.

Intel’s engineers describe this as something of a trade-off: the cost of moving more resources on-die and making them share power, which they will ultimately do more conservatively, they say.

“We look at power efficiency and the ability to best utilize our power,” said Intel’s Gianos. “A lot of the resources on the platform are now shared. How are we best sharing the power that we have across all the devices — the cores, the memory system, the platform? I think integrating the voltage regulator in Haswell is a way of better utilizing shared resources.”
On-die core clustering
Virtualization added a new dimension to multitasking. Years ago, it was the operating system’s job to schedule the time and resources apportioned to specific tasks. Today, there’s at least one layer of added abstraction, so user applications are now contained within virtual machines. It’s the job of the VM management platform, or the orchestration layer (a term that’s just now catching on), to coordinate how much of the CPU’s attention that VMs are allowed to have.

Because of this, one aspect of computing is more applicable today than even a decade ago: Certain tasks tend not to require a whole lot of cores to themselves. For high-performance tasks, the use of algorithms enables compilers to break processes down further, into smaller nuggets. So for an orchestration layer to actually make use of all 18 cores in the proverbial “orchestra,” it may help from time to time if the CPU had a way of subdividing that batch of cores and distributing them. Now it does. Intel calls this concept cluster on-die (COD), and for high-density computing applications such as the new wave of NFV communications functions, it could become invaluable.

“Not everybody finds an adequate way of sharing across 36 threads,” Intel’s Gianos admitted (remember, each Intel core is hyperthreaded, counting as two). “Not all application workloads need that level of cooperation… So we’re introducing a concept where we have a pair of clusters and a caching agent, and a set of cores that will operate as a smaller NUMA domain. It provides the opportunity for us to better segment this thing, and that’s the better choice for particular applications.”

Applications built using compilers that recognize Non-Uniform Memory Architecture (NUMA) can take advantage of how modern multicore processors can delegate memory spaces to specific cores, so that main memory becomes, in a sense, territorialized. (This is a concept that, frankly, AMD pioneered a decade ago, although Intel caught up and has surpassed AMD in recent years.) So an application compiled using NUMA principles can present to the processor a so-called “NUMA-optimized workload.”

At least at first, dividing cores into two clusters could wreck NUMA’s territorial design. So Intel now incorporates two caching agents into the v3 die, enabling its 45 MB last-level cache (LLC) to be divided into two segments. Each segment is then assigned to a 9-core cluster (in the 18-core configuration). It’s not as easy as it seems: Since there’s only one pool of main memory for both clusters, the two half-caches have to adopt a coherency protocol, so that both caches provide their respective clusters with correct snapshots of main memory contents. Think of what relational databases have to do with two or more concurrent users, and then speed that up by orders of magnitude.

For a two-socket server, a NUMA-aware application (such as a VM orchestration system) can indeed set up four clusters of 9 cores each with individual caches using a shared cache line. This way, NUMA-optimized workloads can independently exist in their own clusters, and the memory controllers will effectively point to NUMA addresses that are aligned with the cores in their designated clusters.

What does this mean? Some compute-intensive, high-performance applications do not scale well with additional cores; in fact, adding cores can introduce latencies. Intel has observed that HPC applications tend to be developed for better NUMA optimization anyway, to improve performance as much as possible. They probably don’t need to be redeveloped. So cluster on-die takes advantage of how these applications are most likely designed, driving up processor utilization by reducing the number of resources they have to manage, while delegating the rest to other processes. Results of tests conducted in recent weeks by Intel show Xeon E5 v3 with between 20 and 30% performance gains for high-performance workloads over v2, attributed in large part to COD.

“Northbound” telemetry
In the realm of software-defined networking, there’s a term called northbound, referring to how APIs expose information from network components to management software. In SDN diagrams hauled out during long presentations to telecom engineers, the direction of that exposure is typically upward. Now, that upward trend is being felt with the server’s most important physical component, and making it play a role in a software-defined infrastructure.

“From our perspective, the point is, what can we do at the lowest levels of the platform to expose that information northbound, so that we can optimize the way services can provision [the CPU]?” asked Dylan Larson (below), Intel’s director of datacenter group product lines. “And be able to provide higher levels of assurance, of security, of awareness so that you can manage the way the workload is provisioned? I’ve been excited about this idea of what it takes to bring low-level instrumentation, and project it northbound into these higher levels of orchestration software.”

Back in May, Intel launched a software product called Service Assurance Administrator (SAA) that enables admins to provision multi-tenant environments that orchestrators such as vSphere and OpenStack can easily pick up and run with. One of SAA’s objectives is to enable a “northbound” orchestrator, to borrow that phrase again, to ensure that servers are running applications and services at specified service levels, and are in compliance with IT policy.
With Xeon E5 v3, Intel is promising to open up a wealth of new telemetry to SAA and other services.

“Putting in place new telemetry controls into the microprocessor enables us to try to make compute more efficient,” said Intel’s McConnell, “to remove the bottlenecks of people having to manually provision some of the architecture of the infrastructure.”

Who benefits and when?
Prior to Intel’s official announcement of the new features and enhancements in Xeon E5-2600 v3 series, HP, followed by Dell, and then Cisco, announced completely new or substantively revised server product lines, based on v3 processors. The parts of their announcements that had to wait until today boil down to two major points: Vendor-branded Xeon E5 v3-based servers will include vendor-branded provisioning tools that let them be instantly provisioned “out-of-the-box,” deployed, and then managed remotely.

This provisioning will extend deeper than before, is much more role-based, and is more cognizant of how x86 servers can play a more organizational role in cloud-based datacenter environments. For example, some of Cisco’s new v3-based server blades are designed for the first time with processor and local memory only, so that Cisco network fabric may substitute for the usual PCIe service bus. This way, individual CPUs or CPU pairs can skip local storage and local I/O (features that are more useful for desktop PCs anyway) and enable these provisioning tools to pool storage and resources from the hybrid cloud.

New classes of E5 v3-based servers will be built for SDN and communications-specific workloads only, utilizing Intel’s highly anticipated 40-gigabit XL710 Ethernet controllers — the so-called “Fortville” platform. Servers with Fortville will include Intel’s new Ethernet Flow Director software, which improves the performance of servers operating as virtual switches, and also exposes APIs to SDN and NFV components, such as virtual firewalls. This will give certain vendor-branded servers clear performance edges over “bare metal” — a term that pretty much describes the type of platform OpenFlow and OpenDaylight have preferred, at least until now.

Clearly, Intel’s intent is to give partner vendors the tools and resources they need to distinguish their products in emerging markets, such as SDN and hybrid cloud, to a fuller extent than they’ve been able to before. Ever since the x86 server era began, the word on the street has been that a server is a server is a server, especially when they all use the same CPU. Intel wants the inverse of that assertion, for a world where one brand of Xeon E5 v3-based server is completely different in form and function from all the others… because they use the same CPU.

Intel's 18 core Grantley hits the data center

Archived Content

Unlocking data center profitability: A guide to DCIM solutions

The make vs. buy decision for data center infrastructure management software – A clear choice

2023 Data Center Market Trends: Hong Kong Asia's Connectivity Hub

Emerging Energy Storage Technologies