Luiz Zaniolo, Christian Garbin, Oge Marques
©SHUTTERSTOCK.COM/PDUSIT
Deep learning (DL) has revolutionized the field of artificial intelligence (AI). At its essence, DL consists of building, training, and deploying large, multilayered neural networks. DL techniques have been successfully used in computer vision (CV), natural language processing (NLP), network security, and several other fields. As DL applications become more ubiquitous, another trend is taking place: the growing use of edge devices. The combination of DL/AI solutions deployed on portable (edge) devices is known as edge AI.
Edge devices take many forms, including network routers wired to data center racks, small general-purpose computers (e.g., Arduino, Raspberry Pi, and Nvidia Jetson), many variations of Internet of Things appliances, and the powerful smartphones carried around by most of the global population today. Despite their diversity, edge devices have one thing in common: limited resources (compared with traditional personal computers and servers). They have limited memory, have less powerful processors, and are often battery operated. Because of these constraints, running DL applications on these devices has its own set of challenges. This article discusses the advantages of using DL applications on edge devices and the constraints that make this environment challenging, and it offers suggestions for dealing with these limitations.
The typical DL workflow consists of the following stages (Fig. 1):
Fig 1 The typical workflow of DL applications in the context of edge applications. Models are trained on GPU-equipped computers using a large data set. The trained model is used on edge devices to make inferences (predictions).
This article concentrates on the inference and adaptation aspects and shows that developing a DL solution with edge devices in mind may potentially impact every step of the DL workflow.
The growing use of edge devices for DL applications has been motivated primarily by six factors: latency, privacy, reliability, customization, energy efficiency, and cost.
Many edge AI scenarios require very low operation latency, i.e., the applications should make quick decisions based on input data coming from sensors. In those cases, the overhead of (securely) sending the data to a server located in the cloud—where the DL model is used to make inferences on new data—and sending the inference results back to the device may be prohibitively high.
When an edge device is capable of running inference locally, the data required for inference (e.g., recognizing a face or a fingerprint to unlock a smartphone) remain local and subject to the privacy settings of the device and its immediate environment. The alternative scenario, sending private information, such as pictures, intellectual property (code and documents), a current location, or a voice, to a cloud server for inference, requires additional processing (e.g., end-to-end encryption) to avoid exposing the data to third parties.
Cloud-based inference requires reliable Internet connectivity, which may be subject to hard-to-control constraints (the weather, terrain, and so on) and cannot be guaranteed in remote locations. Once again, the ability to run DL applications directly on the edge device alleviates these problems from the very beginning.
Edge devices might adopt customized hardware and software for meeting performance requirements while running DL inference. Examples include the design of field programmable gate arrays (FPGAs) that can be customized for DL, delivering faster results with less power consumption.
Running DL models directly on edge devices also leads to energy savings, thanks to the reduced need for network communication tasks (and the associated overhead); the use of energy-savvy hardware devices (e.g., custom-made FPGAs); and other aspects, such as the use of static random-access memory (RAM) chips, which require less power to operate because of the absence of the refresh cycle (in the dynamic RAM counterpart) or the ability that some mobile devices have to lower their CPU clock frequency when operating in less-demanding situations, allowing the device to conserve its battery power.
The combination of several of these aspects might also lead to cost savings by eliminating specific tasks (e.g., encryption, radio transmission, and network routing) or by using sophisticated hardware (e.g., smartphones or embedded electronics in vehicles) without the need to design customized hardware.
The following are some of the most popular groups of edge AI applications in use today (Fig. 2).
Fig 2 Examples of DL on edge devices, with an emphasis on the cases where users are directly involved with the model prediction: (a) CV, (b) AR, and (c) NLP.
CV is the foundation of several applications. CV models can detect and identify objects, people, and even the individual components of an object. For example, they can separate cars from buildings or delineate a person’s ears, nose, and mouth. Numerous applications, ranging from amusing to life-saving ones, can be created with these CV building blocks. The filters in social media applications use models that accurately segment a person’s face to manipulate or replace them. The smartphone version of Google Translate uses CV to recognize and extract text from signs and menus before passing them on to a natural language model. Self-driving cars rely on object detection to identify and obey traffic signs.
Users expect accuracy and a reasonable response time from CV applications. The tolerance for delays varies by the nature of the application. An application that identifies friends in a picture can take a second or two to complete its work. A face detection and replacement filter that is delayed by a few milliseconds on a video stream is usable but becomes irritating for longer delays. A traffic sign detection model that takes more than a few hundred milliseconds to detect a stop sign can be deadly.
CV applications on edge devices [Fig. 2(a)] need to cope with the limitations of the device (e.g., less memory and less processing power than standard computers as well as lower-resolution cameras), changing environments, and the limitations of the user operating the device. For example, face detection and recognition applications must work well on various smartphone models, under good and less-than-ideal lighting conditions, with users who may not have a steady hand.
Augmented reality (AR) applications [Fig. 2(b)] add visual, audio, and haptic feedback to (typically) live video streams. Visual feedback includes drawing lines on roads and sidewalks to indicate where to go, adding “name tags” to people to help them remember their names, and replacing the text on signs with the translated text. Audio and haptic feedback are added to help call attention to actions that need to be taken. For example, cars play an audible alert from specific speakers to indicate on which side of the car an obstacle has been detected. Smartphones vibrate at higher and higher frequencies as the user approaches a turn when following directions.
In addition to the traditional CV tasks (object identification and classification), AR applications have to track objects with anchors. For example, an app overlaying a path on a sidewalk has to identify the sidewalk and keep track of it as the user moves the smartphone around. All of the limitations discussed for CV applications apply for AR on an even larger scale. Creating the anchors and tracking them consumes more processing power than simply identifying objects.
Because AR applications work with live video streams, they have strict responsiveness requirements. Usually, video streams update the screen at least 30 times/s or even more frequently. At that frame rate, an AR application has to update one video frame every 30 ms to be responsive. This task is challenging even for regular computers.
NLP models interpret written and spoken human language. NLP models were recently extended to other areas, such as programming languages (but still kept the “natural” part of the abbreviation). The models can operate in a localized context, such as predicting the next word when typing a text message, or in a larger context, such as summarizing an article or answering questions about a large piece of text.
Users expect NLP to work efficiently and accurately on their language. When typing a piece of text, they expect timely suggestions for the language they are using at that moment (which may vary from one piece of text to another). Inaccurate suggestions or suggestions presented after the user has advanced in the text render the application useless. The traditional methods to resolve these issues, for example, running the model on a separate thread, may not work well in low-powered devices. Even when they work, naively running the complete model frequently (for example, for each keystroke) may drain the battery.
The limited amount of memory also constrains NLP applications on edge devices [Fig. 2(c)]. Most of the recent successes in NLP rely on large models. (GPT-3 is the latest example of a successful but enormous NLP model.) Some applications (for example, predicting the next word in a sentence) may work with smaller NLP models. Large applications, such as voice assistants, may resort to sending data to the cloud because their models do not fit on edge devices. Getting data out of a device may violate the user’s privacy expectations. Balancing functionality with privacy is a fundamental concern of sophisticated NLP applications today.
In engineering, just as in life, there is no free lunch. The advantages of running inference on edge devices listed earlier come at a cost. This section highlights some of the constraints and limitations imposed by edge devices that might offset some of the advantages mentioned earlier.
The limited processing power of edge devices can translate into latency, i.e., noticeable performance delays, impacting the user experience.
Google’s RAIL model defines users’ perceptions based on performance delays. Depending on the application, different delay values can be perceived as a break in the flow. RAIL defines different categories for delays:
DL models “learn” by adjusting their parameters (or weights) during training. The more parameters a model has, the more accurate it is (in general terms). However, the more parameters a model has, the more memory it uses. Some popular DL models, such as the VGG and ResNet model families used in CV tasks, are so large that they may not fit in the limited amount of memory available in edge devices. Even when they fit, their memory requirements may cause undesired side effects for the user. For example, the operating system may be forced to evict other applications from the memory to make room for the model.
High-level CPU usage requires edge devices to increase the processor clock frequency, consequently using more energy. Even for devices that possess the needed processing power, it is preferable not to use it at the maximum level for an extended time because of battery consumption in this condition. Smartphone users dislike applications that consume extra power because they need the battery life for other tasks.
The majority of edge devices rely on wireless connections to communicate to cloud servers. The wireless infrastructure may not deliver high throughput, resulting in restricted bandwidth for devices. Physical factors, such as base station coverage, weather, or natural obstacles, may also reduce the reliability of the connection.
State-of-the-art smartphones have high-resolution cameras, but they are usually not specialized for a technical picture that might need special filters or macro capabilities. Smartphones have many other sensors, such as accelerometers, magnetometers, barometers, and GPS, but they are not as precise as professional, dedicated devices for those measurements.
Labs usually have the machinery to set the ideal lighting conditions to acquire an image. This is not true for an edge device in the user’s hand. The light intensity can decrease brightness and contrast, and light temperature can influence colors.
Background information can interfere with the captured sample. In the case of pictures, an inadequately chosen background or, in the case of sound, background noise mixed with the sound sample can make it difficult for neural networks to interpret the input.
Professional equipment is operated by trained professionals. When users are performing sample acquisitions with an edge device, their lack of knowledge of the device’s capabilities can influence the quality of the sample. For example, a user unaware of the minimum focal distance allowed by a smartphone camera can capture an out-of-focus image.
The user’s physical impairments can influence the quality of the sample. A user’s conditions, such as tremors or visual difficulties, can make image acquisition more challenging, consequently delivering a lower-quality result.
To address the device, environment, and user limitations and work efficiently on edge devices, a DL model needs to do the following:
Luckily, these items reinforce each other. Generally speaking, smaller models need to perform fewer calculations for inference; therefore, they have lower latency and consume less power. The following subsections describe some of the techniques that can be used to achieve these desirable goals. Table 1 shows how these techniques and associated goals are related.
Table 1. The mapping of suggested techniques to areas of improvement.
When choosing a DL model for an application, we must not leave behind the option of not using DL at all. Although “machine learning” and “neural networks” (together with “DL”) are frequently conflated, there are other machine learning solutions that do not use DL. For some problems, logistic regression, decision trees, random forests, naive Bayes, k-nearest neighbors, and other algorithms may perform just as well as a DL model—or at least well enough for the application.
Before deploying a DL-based application, we should compare its performance to other ML algorithms. The answer may be surprising for some use cases. At a minimum, it will validate the decision to deploy the more complex and resource-intensive DL model.
Specialized techniques aim to either 1) increase a model’s accuracy without increasing its size or 2) reduce the model’s size without significantly affecting its accuracy. These techniques can be applied during training or after the model has been trained. Here are some examples:
Specialized techniques can be combined. For example, a model can be trained with patterned stride, pruned, and quantized to reduce the size of the weights that survived pruning.
Some network architectures optimized for the constrained environment of edge devices have been proposed in recent years. MobileNet (which introduced “depthwise separable convolutions”) and SqueezeNet (which introduced “squeeze layers”) were explicitly created for CV tasks on devices with limited memory and CPUs. MobileBERT was created for NLP tasks on these limited devices. Tools like TensorFlow Lite and PyTorch Mobile help build networks specialized for edge applications.
Desktops and servers typically use GPUs to accelerate DL training and inference. However, the GPUs they use are power hungry and expensive and, therefore, not suitable for edge applications. In the past few years, devices with specially designed GPUs have been explicitly created for edge applications, for example, Google-backed Coral and Nvidia’s Jetson line.
While fast, GPUs are still general-purpose pieces of hardware. FPGAs go one step further, making the hardware configurable to specific tasks. The customization at the hardware level makes them more power efficient, with lower latency than general-purpose GPUs.
The environment and user limitations may result in having to deal with data that do not match the distribution of the data set used to train the model. For example, the lighting settings of a factory may not match the lighting settings of the pictures used to train a model that identifies defective parts, or the user’s device may have a lower-end camera, affecting all samples that a CV model has to work with.
To cope with these cases, we need to be prepared to fine-tune the model at the edge, within the environment in which the model is being used. Fine-tuning is, conceptually, the same process used to train the model, except that now we use a much smaller data set comprising samples taken in the actual environment. Because the number of samples is small, this process can be executed on the edge device. An example of this process is Apple’s Face ID. Users are asked to provide a sample of their face on their devices to fine-tune the generic model delivered by Apple.
Most devices offer methods to read how much memory is installed, the CPU family they use, the camera resolution, and other characteristics. Applications can reconfigure themselves at runtime using that information to make the best use of a device’s capabilities. For example, a CV application may run a ResNet model (larger and more accurate) on more powerful devices and a MobileNet model (smaller and less accurate) on less powerful devices.
As we enter the age of edge AI, there will be a growing need to build and adapt DL solutions to a wide range of devices and use cases. This is not only an optimization problem: it is also a design problem that can be solved by addressing multiple stages of the DL workflow.
Luiz Zaniolo (lzaniolo@fau.edu) earned his Ph.D. degree in computer science from Florida Atlantic University in 2021. He is part of the MATLAB Mobile development team at MathWorks, Natick, Massachusetts, 01760, USA. In the past, he has held software development roles in eCommerce and telecommunications.
Christian Garbin (cgarbin@fau.edu) earned his M.S. degree in computer science from Florida Atlantic University in 2020 and is currently a Ph.D. student at the same institution. He is also a senior architect and a distinguished expert at Atos, Boca Raton, Florida, 33431, USA, a major IT company. He has held software development and management positions in telecommunications and financial industries, working with highly available, high-performance products.
Oge Marques (omarques@fau.edu) earned his Ph.D. degree in computer engineering from Florida Atlantic University, Boca Raton, Florida, 33431, USA, where he has been a professor since 2001. He is the author of 11 books and more than 120 scholarly publications in the area of visual artificial intelligence. He is a Sigma Xi Distinguished Speaker, a Fellow of the Leshner Leadership Institute of the American Association for the Advancement of Science, a Tau Beta Pi Eminent Engineer, and a Senior Member of both IEEE and the Association for Computing Machinery.
Digital Object Identifier 10.1109/MPOT.2022.3182519