From the simple smartphone to the most powerful supercomputer, they will soon be everywhere. The neural processors will boost the performance/power consumption ratio of Deep Learning, both during the learning phase on the servers and in the execution phase on all types of terminals.

While the uses of Machine Learning and Deep Learning are becoming widespread, a new generation of processors is on the horizon: NPUs or the Neural Processing Units.

Currently, to create a neural network that will then identify the behavior of visitors to a merchant site, we analyze the operation of industrial equipment as part of a predictive maintenance. And then the Data Scientist uses conventional servers to run their frameworks.

However, an x86 processor, let alone the Xeon chips of a typical enterprise server, are not optimized for such calculations. Simulating the behavior of virtual neurons does not require the floating point precision of modern chips. It can even be content with working in 16 bits.

The important thing is to maximize the speed of data exchange between a very large number of compute nodes, with the lowest possible latency.

As a result, developers are increasingly deploying Machine Learning and Deep Learning frameworks on GPUs (graphics accelerators). With their very large number of cores simpler than the 64-bit x86 cores of conventional processors, GPUs offer a much better performance/power consumption ratio.

With its Deep Learning SDK, NVidia already supports all the current AI software frameworks: Caffe created by Berkeley researchers, Caffe2, Google’s TensorFlow, Microsoft Cognitive Toolkit (CNTK), Theano.

GPU builders adapt their chips for AI

GPUs are currently the easiest platform to implement neural networks. But it is possible to do better with cores and architectures of processors actually thought for AI. This is the approach followed by NVidia and AMD, which have in turn unveiled new architectures for Deep Learning.

The first offers Volta, a chip with 640 specialized cores called Tensor Cores and 5 120 CUDA cores, more classic in an NVidia architecture.

According to its designers, this architecture is announced at a power greater than 100 TFLOPS. Designed to train neural networks, this chip equips the Volta V100, an acceleration card primarily for data centers and available in PCIe format. It is also available in SXM2, a proprietary card format that should allow computer designers to extract up to 125 TFLOPS from a single card.

AMD, NVidia’s great rival is not left out since the American offers Radeon Instinct, a range of GPUs to accelerate Deep Learning type treatments. The American is providing developers with an optimized version of the Open Source library designed for the Deep Learning MIOpen acceleration as well as the ROCm software layer.

The latter makes it possible to use its Radeon Instinct cards to accelerate the Caffe, Tensorflow and Torch frameworks, initially designed by Facebook.

IBM, the pioneer of neural chips

NVidia, like AMD, has chosen the UMB2 fast memory to boost the performance of its cards, but these remain essentially developments of its GPU. It is undoubtedly IBM Research which is the first of the great actors of the computer science to have imagined designing specific electronic components to carry networks of neurons.

Funded by DARPA, the SyNAPSE research project was launched in 2008 and after several stages of development culminated in 2014 with the development of the TrueNorth chip. Still experimental, it is capable of simulating 16 million neurons and 4 billion synapses for an electrical consumption of 2.5 W only.

IBM has undoubtedly achieved significant results in terms of performance/consumption ratio, but the distribution of its chip remains confidential. Only the US atomic research laboratory Lawrence Livermore officially has this chip to help researchers analyze the results of their simulations.

IBM is far from the only one working on the design of a chip to boost the performance of Deep Learning. Google has created a surprise by unveiling its TPU (Tensor Processing Unit) chip at its Google I / O developer conference in 2016. Sundar Pichai announced that this chip had a performance/consumption ratio 13 times higher than that of the best GPUs of the moment.

Microsoft prefers to bet on FPGA components

It will be several months before this chip comes on the market in the form of acceleration card. For now, some are turning to solutions that are immediately available in volume and more powerful than GPUs to perform a specific task, FPGAs.

Jean-Laurent Philippe recalls: “Today, the best solution to make the inference is still the FPGA and it is the Intel FPGAs that Microsoft has chosen for its Brainwave project.” In this project led by Microsoft Research, engineers have sought to create an infrastructure to run real-time Deep Learning algorithms at the data center scale. Their choice was FPGA chips for their very low latency. Especially useful for Deep Learning: “When you look at the cost-effectiveness of technical solutions for a specific task like Deep Learning, the Xeon is the most flexible solution because it can do it all.”

The FPGA is very good on repetitive tasks where the algorithm varies only very occasionally. If we rely on an algorithm that will never change, then it is engraved in an ASIC and it’s a bit like an NNP accelerator, “concludes Jean-Laurent Philipp e.

Neural processors in smartphones

If AI chips are making their way into big data-centers, we’re starting to find them much closer to the users, even in the hands of all of us. NPUs arrive in mobile devices. The Chinese Huawei has even made a key marketing argument for its new generation of smartphones. The Kirin 970 SOC, which equips its high-end Mate 10 but also the Honor (Huawei brand) View 10, a mid-range smartphone, integrates a NPU.

The NPU aims to improve the user experience by accelerating machine learning algorithms. It is specially implemented for image recognition. Any smartphone can do image recognition with the CPU, but thanks to the NPU, this function is much faster. This allows the photo application to recognize the nature of the scene and switch to the shooting mode among the 13 supported.

While various smartphone features exploit this AI engine, mobile app developers also have access via an API. This one is already implemented by Microsoft for its Microsoft Translate translation application. The application requests the NPU via its APIs to recognize the text in the images filmed by the user in order to launch a translation of the text.

Apple has also shipped a neural chip in its iPhone X, called “A11 bionic neural engine” by Apple marketing. This chip with 2 cores is capable of performing 600 billion operations per second, according to figures given by Apple. It is it that is implemented by the family recognition FaceID or by the Animoji.

Multiple startups seek to catch up with the big players in the market

In addition to these giants of electronics, a multitude of startups are trying to find a place in this market. This is the case in the United States with Cerebras Systems, Wave Computing, Groq (the startup that developed the Google TPU), Asia or Europe with Graphcore or the French Kalray. The second generation of its Massively Parallel Processor Array (MPPA) processor has 288 cores.

Benoît Dupont of Dinechin, Technical Director of the French Startup emphasizes: “Our MPPA processors are effective especially in the neural networks in ‘inference’ mode, where the key performance indicator is the low latency response. Other machine learning techniques based on dense linear algebra calculations are suitable for MPPA processors, such as ‘Support Vector Machines’ (SVM). “

Unlike GPUs, processor cores designed by the French are programmable in C / C ++. They can, therefore, be exploited for other types of programs, including embedded software of a missile (Safran and MBDA are among the Kalray investors) or software for assistance in driving an autonomous car.

Machine learning (including deep neural networks) is not the totality of artificial intelligence found in a robot or autonomous vehicle; In addition, there is pretreatment, representation, segmentation, and fusion of sensor data, situational analysis, and trajectory planning.

The race for innovation in neural processors is just beginning.