ARM released the first generation of processors
Published time: 2019-12-20 11:21:39
ARM released the first generation of processors for AI and machine learning, the architecture is called "Trillium", which absorbs the advantages of the most successful innovations in hardware, data compression and compilers.
ARM Official said that the processor abandoned the cache, combining the capabilities of NVIDIA Tensor Core, have the ability for the program ability of the FPGA, and the low-power processing power of the DSP.
In the past few years, several CHIP STARTUPS have been looking for new ways to effectively train and execute neural networks. But on the basis of existing technologies and ideas, is it really necessary to start from scratch.
This week, at the annual Hot Chips conference, ARM displayed its first-generation machine learning processor, which is expected to be available to ARM partners later this year.The processor architecture is called "Trillium".It is bundled with some of elements that we are familiar with and ARM's logic core.
For those who need the TensorCore function of the Nvidia Volta GPU, the ARM processor may be Significant, such as DeePhi neural network compression technology (part of Xilinx), FPGA programmability, and DSP's low-power processing capabilities.
In other words, the ARM's processor is considered to be the best AI processor in the world, which can cause a lot of trouble for chip makers who put a lot of extra space on large general-purpose devices.
ARM's technical director Ian Bratt said that ARM's first foray into AI processor design goal is to promote as much as possible, in order to meet the market demand of server-side AI, and use its own AI
processor for more cars,small devices with IoT requirements on Hot Chips this week.
“In the process of developing the first generation of machine learning processors, we made some mistakes in the early days and we applied the old framework to new problems. We know how GPU, CPU and DSP are used for machine learning, but we started to study how to make clear use of every technology. We can use CPU technology to handle control and programmability problems, and use GPU technology to solve data compression, data movement and computational density issues, which can improve DSP efficiency and open source software. Development."
As shown in the following figure, ARM's machine learning architecture is nothing special, but it is worth to paying attention to that the architecture absorbs the benefits of the most successful innovations in hardware, compression, and compilers.
The module for building the architecture is the computing engine, each with 64 KB of SRAM slices, for a total of 16 blocks. The MAC engine (unlike NVIDIA's TensorCore) is where the convolution is performed, and the programmable layer engine handles most of the necessary shuffling between the various layers of the network. The architecture has a DMA engine for communicating with an external memory interface.
No need for caching, the control flow will be greatly simplified
For an innovation-based company, ARM is taking its own unique path. The company is involved in artificial intelligence chips for the first time.
ARM has made some key innovations on the dot product engine for neural networks, which improves the execution efficiency and reduces the network noise.
One element that we are likely to ignore is the value of static scheduling, which is a key part of the overall performance and efficiency of the chip.
Memory accession patterns are completely statically analyzable and easy to understand and map, but many devices do not take advantage of this.
The CPU has a complex cache hierarchy that can be used for non-deterministic memory access optimization, but for deterministic neural networks, everything can be placed in memory in advance.
Then the compiler generates a command stream for the different components (assigned by the ARM control processor) and arrives at the registers to control these components.
In short，no need for caching. Another benefit is that the flow control process is greatly simplified, which further reduces energy consumption and improves the predictability of processor performance.
The way to handle convolution can further increase efficiency. The SRAM highlights how the compiler allocates some resources for input feature mapping and compression models in the figure below.
Each calculation engine will use different feature mappings across different computational engines.
ARM's MAC engine can arrive eight 16 × 16 dot products. We have already discussed the importance of this, but there are many zeros in these operations that can be detected and adjusted in the MAC engine to avoid wasting more energy.
The ARM chip also has a programmable layer engine,designed to "predict" the processor through programmability.
It uses Cortex CPU technology to support non-convolution operators, as well as vector and neural network extensions.
Higher efficiency can be achieved by using machine learning processor feature map compression techniques, which sound similar to DeePhi's role in CNN compression.
Bratt said that ARM's machine learning business unit has 150 employees currently. As the demand for machine learning continues to grow, this number will continue to increase and machine learning will be integrated into new and existing workflows and configurations.
He said that their goal is to make this work across a range of market segments, but it is not easy to provide a common platform for a class of users with all the features they need.
Streamlining compression processes, using mixed-precision algorithms, and porting to streamlined SRAM on-chip computing and porting them to intensive dot product engines, making ARM's chip IP a compelling one on the market.
They can be further refined for critical workloads.
Compareing with some AI-dedicated processors, the addition of high-bandwidth memory (HMC) to ARM processors may make them easier to identify, but requires an authorized user to understand how these components work together. ARM engineers have really taken advantage of the best AI processor technology from the ecosystem and hooked up with open source software, potentially expanding the scope of licensing.
The figure above shows the 8X8 block on Inception V3, highlighting the lossless compression results implemented by the zero/non-zero filtering method, significantly reducing the size of the neural network. The compression results are retained in the internal SRAM and the network pruning technique is preserved in the SRAM for use when needed.
There are not many options for licensing such technologies, and ARM must also determine which of the most successful and worthwhile technologies and components are available in existing neural network processors.
DNLD PRO EDITION LICENSE NL >