In June, Intel let the world know its latest Xeon Phi chip was ready for release, proudly announcing a product that aims to hold its own aganst Nvidia’s GPUs.
In a press call, the company talked through a series of slides that suggested Phi boasted better performance than a dedicated GPU at every turn. The Knights Landing had landed, it was impressive, and Intel had the benchmarks to prove it.
Well, sort of.
Off the benchmark
It’s not unusual for a company to be a bit economical with the truth, especially when the world of benchmarks is largely self-policed. But there is still expectation for certain logic behind the choice of products a benchmark is tested against.
So when Intel chose a plethora of different devices to compare against different aspects of its chip, it was hard not to wonder whether the company cherry-picked the ones that fit its agenda.
For example, Intel said that the Xeon Phi offers up to 38 percent better scaling than GPUs, after testing its product against four-year-old Tesla K20X chips used as part of the Titan supercomputer.
Elsewhere, Intel claimed the Phi is up to 2.3x faster in training than GPUs, based on 18-months-old Caffe AlexNet data on Maxwell GPUs. The same test using current Caffe AlexNet data would have elicited a much faster response from Maxwell.
Intel also said that the Phi can improve its performance by a factor of 50 when scaled to 128 nodes instead of one, but that comparable GPU scaling data simply did not exist. Well, turns out Chinese web services giant Baidu had previously published data showing similar scaling on up to 128 GPUs.
During the briefing call, DCD asked Intel: how does the company decide which GPUs are picked for the benchmark? Hugo Saleh, director of Marketing, HPC Platform Group, said: “It really comes down to the availability of the product, so we don’t necessarily have access to all the different or the latest things that our competition is putting out there when it is released. But what you’ll see is that we tried to build out the fairest comparison we possibly can.”
In this case, surely more recent products were available for the Phi comparison, and, if they were not, then it could be argued that benchmark results were so out of date they were simply not worth announcing.
Over the course of two and a half months, through dozens of emails and several phone calls, DCD tried to get more information on Intel’s benchmarking decisions. Eventually, an Intel spokesperson said: “There is simply nothing more to say than the fact that Intel used the latest available public information for the comparisons made at the time – and we have shared the deck in full. Intel is not prepared to expand on that or discuss the issue further.”
Nvidia has made no secret of its disagreements with Intel, especially on the benchmarking front. In a recent blog post, the company hit back: “While we can correct each of their wrong claims, we think deep learning testing against old Kepler GPUs and outdated software versions are mistakes that are easily fixed in order to keep the industry up to date… They should get their facts straight.”
The controversy over the latest set of results echoes previous comments made by AMD, which called out Intel for using the SYSmark benchmarking tool, which AMD said favored its rival’s product.
“The recent debacle over the emissions ratings provided by a major automaker provide the perfect illustration as to why the information provided by even the most established organizations can be misleading,” said AMD’s Commercial Client Business dev John Hampton in reference to Volkswagen’s emissions scandal.
Now, of course Intel’s special logic for picking comparisons may not exactly rival the catastrophic and harmful VW impropriety, but there is still damage caused. Businesses need to be able to make purchasing decisions based on fact - something made rather difficult when misinformation is propagated.
And, sure enough, after the official announcement at the ISC 2016 High Performance Computing event in Frankfurt, the world began to hear that the Xeon Phi was giving Nvidia a run for its money, with numerous publications reiterating that it was 2.3x faster in training, and could scale in a way nothing else could.