Google, Nvidia split top marks in AI benchmark.
Update Time: 2022-07-06 16:35:17
Inference and training, a key part of AI/ML, both general-purpose GPUs and dedicated inference/training gas pedals want to run excellent scores on major popular models and machine learning libraries to showcase their hardware prowess. The industry needs a unified standard for running scores, and to this end, MLPerf, jointly built by major vendors in 2018 based on industry metrics, has taken on this burden.
Over time, MLPerf has become an almost exclusive scoring benchmark for Nvidia, the GPU maker that rules the entire AI hardware market with its products.
The industry consortium oversees a popular machine learning performance test, MLPerf, according to data released Wednesday by MLCommons. There are some surprising results.
Google received top marks
The most stunning result of this run is Google's TPU v4 system. With this architecture, Google broke performance records in all five benchmark tests, with an average training speed about 1.42 times faster than the second-place Nvidia A100 system, even compared to its performance under the 1.0 test, a 1.5 times improvement.
Google's TPU v4 Pod comprises 4096 chips with a bandwidth of 6 Tbps. In addition, Google has a wealth of use case experience. Compared to other companies, Google is the only one in the search and video field that has large-scale popular AI/ML applications.
TPU v4 vs A100 / Google
However, Google is not in direct competition with Nvidia. They are still benchmarked against cloud services companies that use Nvidia GPU systems, such as Microsoft Azure, and Google has made a special cost comparison for this purpose. As shown above, in the training of the BERT model, 4096 TPU v4 chips compared with Azure 4096 A100 chips, Google's solution can save 35%, and under the training of ResNet model can save nearly 50%.
The above results are only tied with Nvidia in all 8 tests, and the results may vary more with system size. Furthermore, Google's TPU is limited to its cloud services, so it is generally not a universal solution. At least competitors like Microsoft and Amazon are certainly not available.
Google opted to compete in only half the benchmarks, while Nvidia and its partners once again dominated results across all benchmark tests.
The version 2.0 round of MLPerf training results showed Google taking the top scores in terms of the lowest amount of time to train a neural network on four tasks for commercially available systems: image recognition, object detection, one test for small and one for large images, and the BERT natural language processing model.
Nvidia took the top honours for the other four of the eight tests for its commercially available systems: image segmentation, speech recognition, recommendation systems, and solving the reinforcement learning task of playing Go on the "mini Go" dataset.
Nvidia is about to be overtaken?
In addition to Google, Intel's Habana Labs' Gaudi2 training gas pedal is also achieving good results. This processor, launched in May this year, switched from the previous generation's 16nm to TSMC's 7nm, thus tripling the number of Tensor processor cores and enabling it to achieve a 3x improvement in ResNet-50 training throughput and a 4.7x improvement in BERT training throughput.
Gaudi2 achieved a 36% reduction in training time on ResNet-50 compared to the results submitted by NVIDIA for the A100-80GB GPU system and a 45% reduction in training time on BERT compared to the results submitted by Dell for the A100-40GB GPU system.
From the results, there are already several vendors whose AI hardware can already benchmark or even exceed NVIDIA's GPU ecosystem in training, but this does not represent the full machine learning training domain. For example, in the test, the vendor is not required to submit the test results of each project. From this perspective, the RetinaNet lightweight target detection, COCO heavy target detection, speech recognition dataset Librispeech, and reinforcement learning Mingo are a few projects where only NVIDIA GPU-based systems submitted their results.
This proves that Nvidia GPUs are still the most competitive in the eyes of vendors such as Baidu, Dell, H3C, Wave and Lenovo.
The software is also important
Another point to point out is that the results of the closed group, which used standard machine learning libraries, such as TensorFlow 2.8.0 and Pytorch 22.04. In this group, Samsung and Graphcore submitted their results based on different software configurations, but the most impressive is MosaicML.
Training time comparison of Composer under ResNet-50 / MosaicML
The gas pedal hardware used by this company is the same Nvidia A100-SXM-80GB GPU as many submitters. Still, they used their library, Composer, written in Pytorch, which was launched in April this year and claims to make model training two to four times faster. In the MLPerf Training 2.0 run, a comparison group using MosaicML Composer achieved a nearly 4.6x improvement in ResNet training speed. However, although Composer supports any model, this speedup performance is currently more evident on ResNet, so the results under other models were not submitted this time either.
Considering that Intel and other companies are already acquiring software development companies like Codeplay to improve their software development strength, MosaicML, as a recently publicized startup and the founder is a former AI lab backbone of Intel, may also be looked at by companies like NVIDIA if it can show better results in the future.
NVIDIA has been dominating the MLPerf for years, and many people think that the MLPerf runtime is a propaganda tool for NVIDIA; however, the truth is that Intel, Google and other companies that value AI also see it as a fair benchmark test, and the MLPerf also has a peer review session to validate the test results further. From the above results, innovation in AI training hardware continues unabated, with both GPUs, TPUs and IPUs pushing the envelope. Still, the runtime results do not mean any use case can achieve high performance, and vendors need to tune their models and software to achieve the best results.
- 420GO V