DDR4 and the conquest of the thermal nightmare

Archived Content

The following content is from an older version of this website, and may not display correctly.

It was 2004. Engineers at IBM, Intel, AMD, and Nvidia had all published their primary requirements for what they considered to be next-generation computing systems. For their processors to continue to appear to be scaling up in both power and performance, memory had to do its share.

Memory had to increase in density, perhaps by a factor of 6 or 8. It had to use lower voltage, but also use substantially less power. It needed nearly double the bandwidth, while at the same time opening up more channels to more modules. The processor manufacturers promised this much: If they could move the logic part of the process off of separate controllers and onto the CPU, simpler DIMMs could concentrate on a shorter list of goals.

One of those items on the short list was conquering the laws of physics. The consequences of failure were summed up in a then-confidential semiconductor industry presentation by HP which, to illustrate its point, depicted a fatal explosion at both towers of the Three Mile Island generator station. Under the heading “Thermal Nightmares,” it explained an untenable state of affairs where the power necessary to sustain expanding memory bandwidth could no longer be maintained by CPUs, whose shrinking die sizes demanded lower power to avoid meltdowns.

The deadline for vanquishing these nightmares was 2010.

It’s fair to say the server industry avoided a meltdown. But while it managed to successfully “cheat death,” if you will, staving off catastrophe for another four years, it faced a new pressure that had been unforeseen in 2004: the new demands of cloud-driven data centers. Today, compute factories such as Amazon, Facebook, and Azure demand huge quantities of low-power processors with lower, not higher, core counts (because scaling up decreases determinism in processes), and very high memory bandwidth.

DDR4 was way overdue.

A late delivery

By most reliable estimates, Samsung holds a 40% share of the global DRAM market. Although Samsung portrays itself as an equal player in the development of the DDR4 standard, in reality, its role is about as equal as Russia’s in a chess tournament.

Samsung’s mass production of DDR4 began in August 2013. While some analysts and a chunk of the press portrayed Samsung as late, the truth is, those semiconductor manufacturers who set those lofty 2004 goals, amid the threat of nuclear annihilation, weren’t exactly ready either. Even today, the first wave of Intel server processors to support DDR4 will not support its highest available transfer rates; though DDR4-3200 is foreseeable, for now, Intel’s Grantley platform processors will be limited to DDR4-2133.

Amid all the promised bandwidth and performance improvements described with numbers and an “x” beside them, Intel admits that for now, its maximum server memory bandwidth improvement from its second to its third generation of Xeon E5 2600 processors, is 14%.

“We’ll get there eventually,” remarked one Intel engineer.

“DDR4 is my baby,” claims J. S. Choi, Jr., Samsung’s senior director for memory product planning. Choi was Samsung’s representative to JEDEC, the standards body responsible for maintaining the interoperability roadmap for the memory industry. It was Choi’s first opportunity to represent a major semiconductor standard to JEDEC, and he describes watching DDR4 take shape there over the past several years like a father anticipating the arrival of a new member of his family.

It was an uncertain arrival.

“One of the challenges we are facing now is, increasing speed means increasing the opportunity for a fail in the data bus,” stated Choi. It was the thermal nightmare, and when Choi took up the mantle for JEDEC, it was more of a likelihood than a possibility.

“Multiple companies presented their insight,” he continued, “for what would be their requirements for future memory generations. Based upon those proposals, we collected all the input from… multiple vendors and platform engineers. They prioritized what would be the most important things on the memory side for 2010, 2012 — power, latency, prefetch, performance. Everyone had a slightly different priority… What they came up with, in short, was, everyone wants to have higher-capacity memory, higher performance per watt, and higher reliability. But the industry wants to have a lower cost for memory as well.”

Despite what the many vendors told JEDEC they wanted from DDR4, Choi says, Samsung knew one principal fact about the computer industry — both as a whole, and among high-end servers in particular: The ecosystem is slow to change. Very gently — especially in the presence of Intel engineers — Choi made clear that if Samsung were to suddenly produce a true solution that met everyone’s requirements, actually adopting that solution may not happen for several more years.

And still, there was one emerging performance requirement that didn’t show up on the vendors’ list.

“When we defined DDR3,” Samsung’s Choi remarked, “the I/O portion of memory operations was not that big — less than 10% of the bandwidth, or something like that.” This was in the era of DDR3-1600, whose maximum theoretical transfer rate was 1.6 GT/sec. The scaling out of servers, in the virtualization era, would require memory to be communicative in a way it never was before.

The laundry list

No single architectural advancement enabled Samsung to respond to everyone’s requests. You could say nothing about DDR4 is revolutionary in itself. There are some architectural breakthroughs, but there are just as many efforts to cheat the laws of physics. (The ones we’re hearing about now are the ones that worked.)

3D Stacking

You can easily imagine how a 3D printer produces an object by depositing hot plastic onto a surface in layers. Though many memory producers have talked about the concept of layering dies in the same way, Samsung is the first to bring this technique to production using a principle that sounds more complex than it is. Called through-silicon via (TSV), it’s essentially a set of matching holes in a master die and a slave die that link together, with electrodes passing between them. The benefit here is increasing the density of memory packages from 64 to 128 GB without crowding them together like airplane passengers.

It’s not the first time dies have been stacked before, notes Choi, though the typical manner is monolithic, he says, like folding a wafer in half and letting wire bonds connect one edge. TSV will be the method of choice going forward, he says, for all manufacturers whether they want to continue the monolithic method, or to slice the design into two dies as Samsung does now.

The payoff comes in the form of vastly increased internal bandwidth, due to the proximity of components that previously had to take serpentine paths to reach one another. Theoretical maximum bandwidths increase from 12.8 GBps on DDR3 to 256 GBps on DDR4 (those are upper-case “B’s”).

Bank interleaving

Prefetching is what all memory chips (and CPUs, for that matter) do to make data available to the system before it’s needed, even if it ends up not being needed. As Choi told me, whenever engineers increase the capacity of a memory component, that increases the overhead necessary to accomplish a prefetch operation. A DDR3 component has 8 banks, and by way of 3DS, DDR4 increased that number to 16.

DDR4 overcomes the overhead problem, in the narrowest 4-bit (“x4”) prefetch window, by interleaving these banks in groups. You see, engineers found that the time elapsed between accesses between different banks was less than successive accesses to the same banks.

“In DDR4, we kept the same number of prefetches,” said Choi, “and using the bank row, we increased the performance.” When the system doubles the prefetch width to 8 bits and then to 16, the number of banks is divided in half each time, but this way engineers can make the choice between trade-offs.

Narrower page size

In the x4 configuration, DDR4 memory reduces the page size for memory access operations from 1 KB to 512 bytes. You’d think this would make memory busier, but in the end, engineers noted they could save activation power by 10%.

“Whenever there’s activation of memory, always 1 KB by itself was activated,” explained Choi. “Then the controller took just a few bytes out of the 1 KB. However, in DDR4, we reduced the number of activated cells in one activation command. If you look at the x4 DDR4 memory, which is very popular in server applications, it has half the page size. It can help to reduce this overall activation and operating power.”

VDDQ termination

This is one of those clever tricks that look so obvious once someone else shows them to you first.

“The power consumption of I/O operations is getting bigger,” said Choi. “The challenge we have is how we reduce the I/O operation power [as a share of] overall system power. But that’s why we changed the topology.”

DDR2 and DDR3 circuits terminated their DC current pathways using a technique called center tap termination, which was useful at the time. It meant there was a constant power draw regardless of whether voltage was high or low. DDR4’s engineers swapped out center tap with VDDQ.

“The theory of VDDQ is, if data is high, there is no DC current path,” explained Choi, implying that the voltage goes high when the data goes low. “Theoretically, you can save up to 50% of the power of I/O operations.”

Command Address Latency

It may seem counter-intuitive that one could increase performance by adding programmed latency cycles into a system. But to the extent that performance is driven by available power, DDR4 engineers were able to reduce power consumption overall by placing the memory unit’s command and address receivers in a low-power state until they’re actually needed. The added Command Address Latency (CAL) gives these receivers time to wake up, but they don’t have to stay on all the time.

The tale of the tape

Server manufacturers told us to expect breakthrough performance from the first wave of Intel Xeon E5-2600 v3-based servers supporting DDR4. The verdicts on whether such performance is actually happening, have only been recorded in the past few weeks.

In benchmark test results sent to the Standard Performance Evaluation Corporation (SPEC) in recent weeks, Dell’s best performing server is a PowerEdge R730, which is based on a pair of Intel’s 8-core Xeon E5-2667 v3 processors. This is Intel’s “frequency optimized” SKU, clocked at 3.2 GHz, for server builders looking for the highest performance consuming the least space, though perhaps at a power cost. The R730 achieved a CINT2006 base score of 62.3 and a peak score of 65.7 (meaning it’s capable of 6570% the peak performance of a 1997 model Sun Microsystems UltraSPARC server running the same test). HP’s ProLiant DL380 Gen9, configured with the same number of 2667 v3 chips, scored a 63.1 and 66.1.

Compare these numbers against their predecessors with Xeon v2 model processors and DDR3 memory. Recent scores for a Dell PowerEdge R720 with a pair of Xeon E5-2667 v2 processors, clocked at 3.3 GHz, were 63.0 for base performance and 67.7 for peak. On HP’s side of the fence, its ProLiant DL360p Gen8 equipped with 2667 v2 chips scored a base of 62.7, and peak of 67.9.

So at least on the high-frequency side of the performance scale, if you believe DDR4 helped Intel make any significant performance gains for Xeon-based servers, you’d also have to believe the new v3 CPUs squelched them. The CPU performance scores across generations don’t appear to be generational.

Are these flat performance numbers essentially the case across the board? Let’s look at SPECint scores for servers using more ordinary processors. The Xeon E5-2620 v3 is the bargain chip in Intel’s “standard” performance tier. In a recent SPEC test of an HP ProLiant BL460c Gen9 server, with a pair of 6-core 2620 v3 chips clocked at 2.4 GHz and DDR4 memory, the server scored base performance of 53.8 and peak performance of 56.2. An earlier ProLiant BL460c Gen8 with 6-core 2620 v2 chips clocked at 2.1 GHz (a little lower, mind you) posted a base performance score of 40.6 and a peak score of 43.0. Now things are looking up: a 31% performance speed gain for the v3, beating out the 14% clock speed boost.

Similarly for Dell servers, a PowerEdge R730 with a pair of 2620 v3 chips clocked at 2.4 GHz, recently posted a base performance score of 53.1 and a peak score of 56.3 (again, very consistent compared with its HP counterpart). Compare these scores against those for Dell’s previous generation PowerEdge R720, with two 2620 v2 chips clocked at 2.1 GHz: base score of 40.6, peak score of 42.9. Same performance surge beating out the clock speed boost.

The 17% performance gain in both HP’s and Dell’s basic servers, moving from the DDR3 to the DDR4 generation, is almost exactly in line with the 14% memory bandwidth improvement — nearly the entire surge could theoretically be credited to DDR4. So why didn’t the higher-end processors see similar gains?

One feasible theory is that the latencies DDR4 adds to the memory access process — some of which are programmed and, as you’ve seen, intentional — compound themselves at higher frequencies. But another equally plausible theory, rather than blaming memory, blames the processor instead. Intel demonstrated new techniques for its v3 series of Xeon processors to mitigate the challenges of servers scaling out. SPEC admits that only one of the test batteries in its CINT2006 suite is best suited to massive parallelization.

For now, the test results show us that DDR4 is at least partially, and perhaps wholly, responsible for performance gains in low-end and mid-tier servers, while at best mitigating what may be performance drop-offs on the high end. This is the early stage of this new technology’s adoption. As JEDEC’s members, including Samsung, were well aware of from the beginning, the adoption phase takes more than just a few months. New compilers that enable the latest wave of processors to make better use of Xeon’s and Xeon Phi’s vastly updated memory controllers, may yet yield the benefits expected for the high end.

The key benefit, however, is the enablement of much lower-power systems in data centers whose compute power per cubic foot will only increase in density. That much denser power tends to lead toward heat, and heat leads to meltdowns. For today, the thermal nightmare has been averted. For today.