How Inferencing Differs From Training in Machine Learning Applications – SemiEngineering

Why it’s so important to match the AI task to the right type of chip.
Machine learning (ML)-based approaches to system development employ a fundamentally different style of programming than historically used in computer science. This approach uses example data to train a model to enable the machine to learn how to perform a task. ML training is highly iterative with each new piece of training data generating trillions of operations. The iterative nature of the training process combined with very large training data sets needed to achieve high accuracy drives a requirement for extremely high-performance floating-point processing. Most new models describe their training requirements in terms of the number of GPU accelerator cards used and the weeks of processing required. Training equipment for typical vision models range in price from hundreds of thousands to millions of dollars and will also require power measured in Kilowatts to operate. These are often rack-scale systems. ML training is best implemented as data center infrastructure that can be amortized across many different customers to justify the high capital and operating expense.
Inferencing, on the other hand, is the process of using a trained model to produce a probable match for a new piece of data relative to all the data that the model was trained on. Inferencing, in most applications, looks for quick answers that can be arrived at in milliseconds. Examples of inferencing include speech recognition, real-time language translation, machine vision, and ad insertion optimization decisions. When compared to training, inferencing requires a small fraction of the processing power. However, this is still well beyond the processing provided by traditional CPU-based systems. Thus, even with inferencing, acceleration (either on the SoC as IP or as an in-system accelerator) is required to achieve reasonable execution speeds.

Some real examples will help to illustrate the scale of computation we are talking about here. In the table above, we see that the computation required to compile the Linux Kernel is roughly 5.4 TeraOps. This computation takes about a minute on a new and well provisioned PC using the Intel i5-12600K CPU. Quite fast! However, taking a minute or even a few seconds to process an image in a vision system is not very useful. Industrial vision systems are looking for sub-second processing speeds. In this example we use 40 milliseconds as the target speed for an inference, the equivalent of 25 frames per second. This leads to a TeraOps/second requirement significantly above what the i5 can deliver. In fact, the X1 accelerator for this workload will outperform the i5 CPU by about 500X for this metric. It should be reiterated that the i5 CPU and the X1 accelerator are solving very different problems. It would be impossible for the X1 accelerator to be used to compile Linux and while the general-purpose processing capabilities of the i5 make it capable of processing inference workloads, it will lag the X1 significantly in terms of performance and efficiency.
Model inferencing is better performed at the edge where it is closer to the people who are seeking to benefit from the results of the inference decisions. A perfect example is autonomous vehicles where the inference processing cannot be dependent on links to some data center that would be prone to high latency and intermittent connectivity.
The table above also presents an example of model training. This is based on a reported training time of 10 days to build the Yolov3 model using the COCO dataset on 4 instances of the GTX 1080 Ti GPU. This workload, if we assume the GPU offers 11.3 TeraFlops and four of them are running for 10 days, then requires something like 40 ExaOps to complete. This is 40 million times more computation than is required for a single inference.
Given the extremely high processing requirements, not to mention data storage and communications requirements for training, data center infrastructure is clearly the best place to do training processing.
With this clear dichotomy between training and inferencing, we must consider whether it makes sense to use the same technology for both applications.
Diverging Solutions for Diverging Problems
In a recent white paper [4], Alexander Harrowell at Omdia noted: “AI model training is highly compute-intensive and also requires full general-purpose programmability. As a result, it’s the domain of the high-performance graphics processing unit (GPU), teamed with high- bandwidth memory and powerful CPUs. This suggests that deploying the same hardware used in training for the inference workloads is likely to mean over-provisioning the inference machines with both accelerator and CPU hardware.”
In the same report, Mr. Harrowell estimates that by 2026 there will be very clear differences in edge and data center solution technological approaches with GPU-based solutions continuing to lead the way in data centers and custom AI ASICs dominating on the edge.

As ML approaches to solution development proliferate, it will be very important for developers to recognize that the tools and technologies that have grown to create ML technology over the last decade, primarily GPU-based solutions, will not be the best solutions for the deployment of ML inferencing technology in volume. The same pressures that have driven the extreme specialization of semiconductor technology for the last several decades will push for efficient high-performance inference processing for edge applications.
There are a number of clear technological reasons why edge inference accelerators will be very different than GPU solutions designed for training and model development. The following table compares the NVIDIA Jetson AGX, which is often used for edge inference applications with the Flex Logix X1 inference accelerator. It’s apparent that the X1 is able to offer much lower power, lower cost and higher efficiency than the GPU-based AGX solution, while still offering a compelling level of performance for inferencing applications.

Software and Tools Requirement Differences
A final important point of differentiation between ML training and inferencing is related to the software environments. In model development training and testing there are many approaches being used today. These include popular libraries such as CUDA for NVIDIA GPUs, ML frameworks such as TensorFlow and PyTorch, optimized cross platform model libraries such as Keras and many more. These tool sets are critical to the development and training of ML models, but when it comes to inferencing applications, there is a much different and smaller set of software tools that are required. Inferencing tool sets are focused on running the model on a target platform. Inference tools support the porting of a trained model to the platform. This may include some operator conversions, quantization and host integration services, but is a considerably simpler set of functions required for model development and training.
Inference tools benefit from starting with a standard representation of the model. The Open Neural Network Exchange (ONNX) is a standard format for representing ML models. As its name implies, it is an open standard and is managed as a Linux Foundation project. Technology such as ONNX allows for a decoupling of training and inferencing systems and provides the freedom for developers to choose the best platforms for training and inferencing.
As machine learning approaches become more broadly adopted in edge and embedded systems, there will be increasing use of edge AI ASIC technology that offers much more efficient and cost-effective performance than GPU-based solutions. The emergence of technologies such as ONNX make it much easier to adopt inference-specific model solutions as they provide a path to cleanly separate the ML training and testing tasks from the ML inferencing task.
Companies looking to deploy AI technology owe it to themselves to evaluate the best solutions for inferencing and not assume that all AI solutions are best implemented on GPU devices.
[1] (Intel Core i5-12600K)

(Note: This name will be displayed publicly)

(This will not be displayed publicly)

Concerns about security and supply chain availability prompt new push.
Mature, low-cost manufacturing and proven reliability spur use in EVs, smart phones, and consumer electronics.
Is there room for an even smaller version of a RISC-V processor that could replace 8-bit microcontrollers?
Cache coherency is expensive and provides little or negative benefit for some tasks. So why is it still used so frequently?
Where it’s working, and what challenges remain for even broader adoption.
As SiC moves to higher voltages, BEV users get faster charging, extended range, and lower system costs
Some segments are normalizing, others may be impacted through 2022.
Imec’s SVP drills down into GAA FETs, interconnects, chiplets, and 3D packaging.
Threats are growing and widening, but what is considered sufficient can vary greatly by application or by user. Even then, it may not be enough.
Massive expansion campaign targets wide variety of chips, but export controls limit growth at leading edge.
Abstraction is the key to custom processor design and verification, but defining the right language and tool flow is a work in progress.
Sharing resources can significantly improve utilization and lower costs, but it’s not a simple shift.
Higher density of interconnects will enable faster movement of data, but there’s more than one way to achieve that.

Connect with Chris Hood, a digital strategist that can help you with AI.

Leave a Reply

Your email address will not be published. Required fields are marked *

© 2022 AI Caosuo - Proudly powered by theme Octo