Deep learning for edge devices

Deep learning for edge devicesLuiz Zaniolo, Christian Garbin, Oge Marques42mpot04-zaniolo-opener-3182519©SHUTTERSTOCK.COM/PDUSITDeep learning (DL) has revolutionized the field of artificial intelligence (AI). At its essence, DL consists of building, training, and deploying large, multilayered neural networks. DL techniques have been successfully used in computer vision (CV), natural language processing (NLP), network security, and several other fields. As DL applications become more ubiquitous, another trend is taking place: the growing use of edge devices. The combination of DL/AI solutions deployed on portable (edge) devices is known as edge AI.Edge devices take many forms, including network routers wired to data center racks, small general-purpose computers (e.g., Arduino, Raspberry Pi, and Nvidia Jetson), many variations of Internet of Things appliances, and the powerful smartphones carried around by most of the global population today. Despite their diversity, edge devices have one thing in common: limited resources (compared with traditional personal computers and servers). They have limited memory, have less powerful processors, and are often battery operated. Because of these constraints, running DL applications on these devices has its own set of challenges. This article discusses the advantages of using DL applications on edge devices and the constraints that make this environment challenging, and it offers suggestions for dealing with these limitations.The typical workflow of DL applicationsThe typical DL workflow consists of the following stages (Fig. 1):zaniolo01-3182519Fig 1 The typical workflow of DL applications in the context of edge applications. Models are trained on GPU-equipped computers using a large data set. The trained model is used on edge devices to make inferences (predictions).

Design the model: The first and often the most critical decision is to choose a DL network architecture suitable for the application. Some architectures are more successful than others for specific applications. For example, image recognition applications typically use convolutional neural networks (CNNs). In contrast, NLP applications use networks that deal with sequences, such as long short-term memory or transformer networks.

Train and test the model: Once the network architecture is selected, the next step is to train the model. Training requires a representative data set, which is usually large, and several rounds of experiments to fine-tune the network hyperparameters. These steps are resource intensive and, therefore, typically not done on edge devices but, instead, on dedicated servers equipped with GPUs.

Use the model for inference: The trained model is deployed (to the edge device) and used to make inferences (predictions). Making inferences usually requires three steps:

The application collects data (a picture taken by the user, words the user has typed, or network packets flowing through a router).

The data are adapted to match the format used to train the model.

The preprocessed data are fed into the model, resulting in an inference (the name of a face detected in the picture, the next word in a sentence, or a benign/malicious label for the sender of a network packet).

Adapt the model to real-life cases: As good as the data set used to train the model may be, it will not necessarily match the real-life data. For example, medical image classification applications may need to be adjusted for images produced by a specific scanner manufacturer (which may not be part of the training data set), and word prediction applications may need to be calibrated to the user’s vocabulary.

This article concentrates on the inference and adaptation aspects and shows that developing a DL solution with edge devices in mind may potentially impact every step of the DL workflow.Advantages of edge devices
The growing use of edge devices for DL applications has been motivated primarily by six factors: latency, privacy, reliability, customization, energy efficiency, and cost.LatencyMany edge AI scenarios require very low operation latency, i.e., the applications should make quick decisions based on input data coming from sensors. In those cases, the overhead of (securely) sending the data to a server located in the cloud—where the DL model is used to make inferences on new data—and sending the inference results back to the device may be prohibitively high.PrivacyWhen an edge device is capable of running inference locally, the data required for inference (e.g., recognizing a face or a fingerprint to unlock a smartphone) remain local and subject to the privacy settings of the device and its immediate environment. The alternative scenario, sending private information, such as pictures, intellectual property (code and documents), a current location, or a voice, to a cloud server for inference, requires additional processing (e.g., end-to-end encryption) to avoid exposing the data to third parties.ReliabilityCloud-based inference requires reliable Internet connectivity, which may be subject to hard-to-control constraints (the weather, terrain, and so on) and cannot be guaranteed in remote locations. Once again, the ability to run DL applications directly on the edge device alleviates these problems from the very beginning.CustomizationEdge devices might adopt customized hardware and software for meeting performance requirements while running DL inference. Examples include the design of field programmable gate arrays (FPGAs) that can be customized for DL, delivering faster results with less power consumption.Energy efficiencyRunning DL models directly on edge devices also leads to energy savings, thanks to the reduced need for network communication tasks (and the associated overhead); the use of energy-savvy hardware devices (e.g., custom-made FPGAs); and other aspects, such as the use of static random-access memory (RAM) chips, which require less power to operate because of the absence of the refresh cycle (in the dynamic RAM counterpart) or the ability that some mobile devices have to lower their CPU clock frequency when operating in less-demanding situations, allowing the device to conserve its battery power.CostThe combination of several of these aspects might also lead to cost savings by eliminating specific tasks (e.g., encryption, radio transmission, and network routing) or by using sophisticated hardware (e.g., smartphones or embedded electronics in vehicles) without the need to design customized hardware.Examples of edge AI applicationsThe following are some of the most popular groups of edge AI applications in use today (Fig. 2).zaniolo02-3182519Fig 2 Examples of DL on edge devices, with an emphasis on the cases where users are directly involved with the model prediction: (a) CV, (b) AR, and (c) NLP.CVCV is the foundation of several applications. CV models can detect and identify objects, people, and even the individual components of an object. For example, they can separate cars from buildings or delineate a person’s ears, nose, and mouth. Numerous applications, ranging from amusing to life-saving ones, can be created with these CV building blocks. The filters in social media applications use models that accurately segment a person’s face to manipulate or replace them. The smartphone version of Google Translate uses CV to recognize and extract text from signs and menus before passing them on to a natural language model. Self-driving cars rely on object detection to identify and obey traffic signs.Users expect accuracy and a reasonable response time from CV applications. The tolerance for delays varies by the nature of the application. An application that identifies friends in a picture can take a second or two to complete its work. A face detection and replacement filter that is delayed by a few milliseconds on a video stream is usable but becomes irritating for longer delays. A traffic sign detection model that takes more than a few hundred milliseconds to detect a stop sign can be deadly.CV applications on edge devices [Fig. 2(a)] need to cope with the limitations of the device (e.g., less memory and less processing power than standard computers as well as lower-resolution cameras), changing environments, and the limitations of the user operating the device. For example, face detection and recognition applications must work well on various smartphone models, under good and less-than-ideal lighting conditions, with users who may not have a steady hand.Augmented realityAugmented reality (AR) applications [Fig. 2(b)] add visual, audio, and haptic feedback to (typically) live video streams. Visual feedback includes drawing lines on roads and sidewalks to indicate where to go, adding “name tags” to people to help them remember their names, and replacing the text on signs with the translated text. Audio and haptic feedback are added to help call attention to actions that need to be taken. For example, cars play an audible alert from specific speakers to indicate on which side of the car an obstacle has been detected. Smartphones vibrate at higher and higher frequencies as the user approaches a turn when following directions.In addition to the traditional CV tasks (object identification and classification), AR applications have to track objects with anchors. For example, an app overlaying a path on a sidewalk has to identify the sidewalk and keep track of it as the user moves the smartphone around. All of the limitations discussed for CV applications apply for AR on an even larger scale. Creating the anchors and tracking them consumes more processing power than simply identifying objects.Because AR applications work with live video streams, they have strict responsiveness requirements. Usually, video streams update the screen at least 30 times/s or even more frequently. At that frame rate, an AR application has to update one video frame every 30 ms to be responsive. This task is challenging even for regular computers.NLPNLP models interpret written and spoken human language. NLP models were recently extended to other areas, such as programming languages (but still kept the “natural” part of the abbreviation). The models can operate in a localized context, such as predicting the next word when typing a text message, or in a larger context, such as summarizing an article or answering questions about a large piece of text.Users expect NLP to work efficiently and accurately on their language. When typing a piece of text, they expect timely suggestions for the language they are using at that moment (which may vary from one piece of text to another). Inaccurate suggestions or suggestions presented after the user has advanced in the text render the application useless. The traditional methods to resolve these issues, for example, running the model on a separate thread, may not work well in low-powered devices. Even when they work, naively running the complete model frequently (for example, for each keystroke) may drain the battery.The limited amount of memory also constrains NLP applications on edge devices [Fig. 2(c)]. Most of the recent successes in NLP rely on large models. (GPT-3 is the latest example of a successful but enormous NLP model.) Some applications (for example, predicting the next word in a sentence) may work with smaller NLP models. Large applications, such as voice assistants, may resort to sending data to the cloud because their models do not fit on edge devices. Getting data out of a device may violate the user’s privacy expectations. Balancing functionality with privacy is a fundamental concern of sophisticated NLP applications today.Limitations of edge AIIn engineering, just as in life, there is no free lunch. The advantages of running inference on edge devices listed earlier come at a cost. This section highlights some of the constraints and limitations imposed by edge devices that might offset some of the advantages mentioned earlier.Device limitationsProcessing powerThe limited processing power of edge devices can translate into latency, i.e., noticeable performance delays, impacting the user experience.Google’s RAIL model defines users’ perceptions based on performance delays. Depending on the application, different delay values can be perceived as a break in the flow. RAIL defines different categories for delays:

Fewer than 16 ms: Humans are exceptionally good at tracking motion and dislike it when animations are not smooth. This is the window of operation for a good AR experience.

Fewer than 100 ms: This is the ideal window in which to respond to user actions. Responses within this time window are perceived as immediate. This is the window of operation for a code prediction application.

100 ms to 1 s: Within this window, user interactions are still perceived as part of the task being executed. This would be the ideal window of operation for an object detection application.

Beyond 1 s: Users tend to lose focus on the task they are performing.

MemoryDL models “learn” by adjusting their parameters (or weights) during training. The more parameters a model has, the more accurate it is (in general terms). However, the more parameters a model has, the more memory it uses. Some popular DL models, such as the VGG and ResNet model families used in CV tasks, are so large that they may not fit in the limited amount of memory available in edge devices. Even when they fit, their memory requirements may cause undesired side effects for the user. For example, the operating system may be forced to evict other applications from the memory to make room for the model.Battery consumptionHigh-level CPU usage requires edge devices to increase the processor clock frequency, consequently using more energy. Even for devices that possess the needed processing power, it is preferable not to use it at the maximum level for an extended time because of battery consumption in this condition. Smartphone users dislike applications that consume extra power because they need the battery life for other tasks.Network connectionThe majority of edge devices rely on wireless connections to communicate to cloud servers. The wireless infrastructure may not deliver high throughput, resulting in restricted bandwidth for devices. Physical factors, such as base station coverage, weather, or natural obstacles, may also reduce the reliability of the connection.Sensor capabilitiesState-of-the-art smartphones have high-resolution cameras, but they are usually not specialized for a technical picture that might need special filters or macro capabilities. Smartphones have many other sensors, such as accelerometers, magnetometers, barometers, and GPS, but they are not as precise as professional, dedicated devices for those measurements.Environment limitationsLighting conditionsLabs usually have the machinery to set the ideal lighting conditions to acquire an image. This is not true for an edge device in the user’s hand. The light intensity can decrease brightness and contrast, and light temperature can influence colors.BackgroundBackground information can interfere with the captured sample. In the case of pictures, an inadequately chosen background or, in the case of sound, background noise mixed with the sound sample can make it difficult for neural networks to interpret the input.User limitationsDevice capabilities knowledgeProfessional equipment is operated by trained professionals. When users are performing sample acquisitions with an edge device, their lack of knowledge of the device’s capabilities can influence the quality of the sample. For example, a user unaware of the minimum focal distance allowed by a smartphone camera can capture an out-of-focus image.Physical impairmentsThe user’s physical impairments can influence the quality of the sample. A user’s conditions, such as tremors or visual difficulties, can make image acquisition more challenging, consequently delivering a lower-quality result.SolutionsTo address the device, environment, and user limitations and work efficiently on edge devices, a DL model needs to do the following:

Save memory: Given the limited memory of edge devices, smaller models are preferred. Smaller models leave more of the scarce memory available for the application code and other tasks running on the device.

Be responsive: Most of the applications discussed here have expectations on the time to complete a task. Models that have lower latency are preferable in most edge applications.

Consume as little power as possible: This is an obvious requirement for models deployed on battery-powered devices. Even for devices connected to the power grid, this is still beneficial because large power consumption has other undesirable side effects, such as the associated heat.

Adapt to the environment: The environment at the edge may not precisely match the data set used to train the model. We need to be prepared to fine-tune the model to adjust to the actual edge data it has to work with.

Luckily, these items reinforce each other. Generally speaking, smaller models need to perform fewer calculations for inference; therefore, they have lower latency and consume less power. The following subsections describe some of the techniques that can be used to achieve these desirable goals. Table 1 shows how these techniques and associated goals are related.Table 1. The mapping of suggested techniques to areas of improvement.zaniolo_t1-3182519Simpler machine learning alternativesWhen choosing a DL model for an application, we must not leave behind the option of not using DL at all. Although “machine learning” and “neural networks” (together with “DL”) are frequently conflated, there are other machine learning solutions that do not use DL. For some problems, logistic regression, decision trees, random forests, naive Bayes, k-nearest neighbors, and other algorithms may perform just as well as a DL model—or at least well enough for the application.Before deploying a DL-based application, we should compare its performance to other ML algorithms. The answer may be surprising for some use cases. At a minimum, it will validate the decision to deploy the more complex and resource-intensive DL model.Specialized techniquesSpecialized techniques aim to either 1) increase a model’s accuracy without increasing its size or 2) reduce the model’s size without significantly affecting its accuracy. These techniques can be applied during training or after the model has been trained. Here are some examples:

Patterned stride changes how a CNN is trained. Instead of using a fixed stride in the convolution layers, it uses a larger stride in areas of less interest in the image and a smaller stride in the essential areas (for example, the center of the image). A network trained with this technique has the same size but better accuracy than a network trained with a fixed stride of two; in many cases, the modified network architecture has a smaller size than and comparable accuracy to a network trained with a stride of one.

Pruning is a technique to reduce the size of trained networks. Most of the size of a network is in its weights. However, not all weights contribute the same amount to the network’s accuracy. Pruning identifies and removes the weights that do not contribute significantly to accuracy. The resulting network is smaller but slightly less accurate.

Quantization is another technique to reduce the size of a trained network by manipulating its weights. The weights are stored as floating-point numbers, each requiring 4–8 B. Quantization reduces them to 1 or 2 B each. Similar to pruning, the resulting network uses less memory but is slightly less accurate. Although both quantization and pruning affect the weights, quantization applies the same size-reduction technique to all weights, while pruning removes a subset of the weights.

Specialized techniques can be combined. For example, a model can be trained with patterned stride, pruned, and quantized to reduce the size of the weights that survived pruning.Specialized network architecturesSome network architectures optimized for the constrained environment of edge devices have been proposed in recent years. MobileNet (which introduced “depthwise separable convolutions”) and SqueezeNet (which introduced “squeeze layers”) were explicitly created for CV tasks on devices with limited memory and CPUs. MobileBERT was created for NLP tasks on these limited devices. Tools like TensorFlow Lite and PyTorch Mobile help build networks specialized for edge applications.Specialized hardwareDesktops and servers typically use GPUs to accelerate DL training and inference. However, the GPUs they use are power hungry and expensive and, therefore, not suitable for edge applications. In the past few years, devices with specially designed GPUs have been explicitly created for edge applications, for example, Google-backed Coral and Nvidia’s Jetson line.While fast, GPUs are still general-purpose pieces of hardware. FPGAs go one step further, making the hardware configurable to specific tasks. The customization at the hardware level makes them more power efficient, with lower latency than general-purpose GPUs.Refined training at the edgeThe environment and user limitations may result in having to deal with data that do not match the distribution of the data set used to train the model. For example, the lighting settings of a factory may not match the lighting settings of the pictures used to train a model that identifies defective parts, or the user’s device may have a lower-end camera, affecting all samples that a CV model has to work with.To cope with these cases, we need to be prepared to fine-tune the model at the edge, within the environment in which the model is being used. Fine-tuning is, conceptually, the same process used to train the model, except that now we use a much smaller data set comprising samples taken in the actual environment. Because the number of samples is small, this process can be executed on the edge device. An example of this process is Apple’s Face ID. Users are asked to provide a sample of their face on their devices to fine-tune the generic model delivered by Apple.Graceful degradationMost devices offer methods to read how much memory is installed, the CPU family they use, the camera resolution, and other characteristics. Applications can reconfigure themselves at runtime using that information to make the best use of a device’s capabilities. For example, a CV application may run a ResNet model (larger and more accurate) on more powerful devices and a MobileNet model (smaller and less accurate) on less powerful devices.ConclusionAs we enter the age of edge AI, there will be a growing need to build and adapt DL solutions to a wide range of devices and use cases. This is not only an optimization problem: it is also a design problem that can be solved by addressing multiple stages of the DL workflow.Learn more about it

Gartner’s “2021 Strategic Roadmap for Edge Computing”: This provides an analysis of the forces driving edge computing, the current state of the art, challenges, and suggestions for businesses to adopt it (https://www.gartner.com/doc/reprints?id=1-24JFAZOO&ct=201104&st=sb).

“Deep Learning With Edge Computing: A Review”: This edge AI review focuses on application scenarios, common techniques for speeding up inference, distributed training on edge devices, recent trends, and opportunities (https://par.nsf.gov/servlets/purl/10201675).

“Convergence of Edge Computing and Deep Learning: A Comprehensive Survey”: This edge AI survey shows application scenarios, practical implementation methods, enabling technologies, challenges, and future trends (https://arxiv.org/pdf/1907.08349.pdf).

“Machine Learning at the Network Edge: A Survey”: This survey of the operational aspects of machine learning on the edge covers software, hardware, tools, and frameworks. It includes a list of major papers on machine learning at the edge, categorized by topic (https://arxiv.org/abs/1908.00080).

tinyML: This yearly summit focuses on the practical applications of machine learning on always-on, low-powered devices equipped with sensors. The summit presentations and other materials are freely available on the site (https://www.tinyml.org).

About the authorsLuiz Zaniolo (lzaniolo@fau.edu) earned his Ph.D. degree in computer science from Florida Atlantic University in 2021. He is part of the MATLAB Mobile development team at MathWorks, Natick, Massachusetts, 01760, USA. In the past, he has held software development roles in eCommerce and telecommunications.Christian Garbin (cgarbin@fau.edu) earned his M.S. degree in computer science from Florida Atlantic University in 2020 and is currently a Ph.D. student at the same institution. He is also a senior architect and a distinguished expert at Atos, Boca Raton, Florida, 33431, USA, a major IT company. He has held software development and management positions in telecommunications and financial industries, working with highly available, high-performance products.Oge Marques (omarques@fau.edu) earned his Ph.D. degree in computer engineering from Florida Atlantic University, Boca Raton, Florida, 33431, USA, where he has been a professor since 2001. He is the author of 11 books and more than 120 scholarly publications in the area of visual artificial intelligence. He is a Sigma Xi Distinguished Speaker, a Fellow of the Leshner Leadership Institute of the American Association for the Advancement of Science, a Tau Beta Pi Eminent Engineer, and a Senior Member of both IEEE and the Association for Computing Machinery.Digital Object Identifier 10.1109/MPOT.2022.3182519CoverCall for PapersMastheadUnderstanding the implications of the digital experienceImmerse yourself in the era of digital transformationIEEE AccessDigital transformation, Industry 4.0, and extended realityReshaping engineered clinical decision support systemsAccelerating smart campus development with an extensible frameworkA novel and intelligent GSM-based smart prepaid water meterDeep learning for edge devicesTransforming our biggest challenge into our biggest successProceedings SubscribeTechRxivArchives