Editing Machine Learning at the Edge (section)

=='''4.3 ML Model Optimization at the Edge'''==

==='''The Need for Model Optimization at the Edge'''===
Given the constrained resources and the inherently dynamic environments in which edge devices must operate, model optimization is a crucial part of machine learning in edge computing. The current most widely used methodology consists of simply specifying an exceptionally large set of parameters, and giving it to the model to train on. This can be feasible when hardware is very advanced and powerful, and is necessary for systems such as Large Language Models (LLMs). However, this is no longer viable when dealing with the devices and environments at the edge. It is crucial to identify the best parameters and training methodology so as to minimize the amount of work done by these devices, while compromising as little as possible on the accuracy of the models. There are multiple ways to this, and they include either optimization or augmentation of the dataset itself, or optimization of the partition of work among the edge devices.

==='''Edge and Cloud Collaboration'''===
One methodology that is often used involves collaboration between both Edge and Cloud Devices. The cloud has the ability to process workloads that may require much more resources and cannot be done on edge devices. On the other hand, edge devices, which can store and process data locally, may have lower latency and more privacy. Given the advantages of each of these, many have proposed that the best way to handle machine learning is through a combination of edge and cloud computing. 

The primary issue facing this computing paradigm, however, is the problem of optimally selecting which workloads should be done on the cloud and which should be done on the edge. This is a crucial problem to solve, as the correct partition of workloads is the best way to ensure that the respective benefits of the devices can be leveraged. A common way to do this, is to run certain computing tasks on the necessary devices and determine the length of time and resources that it takes. An example of this is the profiling step done in [https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10818760&tag=1 EdgeShard] [1] and [https://dl.acm.org/doi/pdf/10.1145/3093337.3037698 Neurosurgeon] [4]. Other frameworks implement similar steps, where the capabilities of different devices are tested in order to allocate their workloads and determine the limit at which they can provide efficient functionality. If the workload is beyond the limits of the devices, it can be sent to the cloud for processing

The key advantage of this is that it is able to utilize the resources of the edge devices as necessary, allowing increased data privacy and lower latency. Since workloads are only processed in the cloud as needed, this will reduce the overall amount of time needed for processing because data is not constantly sent back and forth. It also allows for much less network congestion, which is crucial for many applications.

[[File:ECcollab.png|400px|thumb|center|The collaboration of Edge and Cloud Devices]]


==='''Optimizing Workload Partitioning'''===
The key idea for much of the optimization done in machine learning on edge systems involves fully utilizing the heterogenous devices that are often contained in these systems. As such, it is important to understand the capabilities of each device so as to fully utilize its advantages. Devices can very greatly from smartphones with more powerful computational abilities to raspberry pis to sensors. More difficult tasks are offloaded to the powerful devices, while simpler tasks, or models that have been somewhat pretrained can be sent to the smaller devices. In some cases, as in [https://ieeexplore.ieee.org/abstract/document/8690980 Mobile-Edge] [2], the task may be dropped altogether if the resources are deemed insufficient. In this way, exceptionally difficult tasks do not block tasks that have the ability to be executed and therefore the system can continue working. 

==='''Dynamic Models'''===
Given the dynamic nature of the environments that edge devices must function, as well as the heterogeneity of the devices themselves, a dynamic model of machine learning is often employed. Such models must keep track of the current available resources including computation usage and power, as well as network traffic. These may change very often depending on the workloads and devices in the system. As such, training models to continuously monitor and dynamically distribute the workloads is a very important part of optimization. Simply offloading larger tasks to more powerful devices may be obsolete if the devices has all of  its computing resources or network capabilities being used up by another workload. 

The way this is commonly done is by using the profiling step described above as a baseline. Then, a machine learning model utilizes the data to estimate the performance of devices and/or layers. During runtime, a similar process is employed which may update the data used and help the model refine its predictions. Network traffic is also taken into account at this stage in order to preserve the edge computing benefit of providing lower latency. Using all of this data and updates at runtime, the partitioning model is able to dynamically distribute workloads at runtime in order to optimize the workflow and ensure each device is utilizing its resources in the most efficient manner. Two very good examples of how such a system is specifically deployed are the Neurosurgeon and EdgeShard systems, shown above.

==='''Horizontal and Vertical Partitioning'''===
There are 2 major ways that these models split the workloads in order to optimize the machine learning: Horizontal and vertical partitioning [3]. Given a set of layers that ranges from the cloud to edge, vertical partitioning involves splitting up the workload between the layers. For example, if a large amount of computational resources is deemed necessary, this task may go to the cloud to be completed and preprocessed. One the other hand, if a small amount of computational power is required, this type of work can go to edge devices. Such partitioning also depends on the confidence and accuracy level of the given learning. If the accuracy is completed on an edge device and found to be very low, it can be sent to the cloud; on the other hand if the accuracy is already fairly high and the learning model needs smaller work to reach the threshold deemed acceptable, it may be sent to edge devices to free up network traffic on the cloud and reduce latency [3].

The second model of partitioning is called horizontal partitioning. This involves splitting among the devices within a certain layer rather than among the layers themselves. This is similar to what has been described in previous sections, as it allows a means for fully utilizing the heterogenous abilities that are found in edge devices. Similar functionality and determination to what is found in horizontal partitioning is done, but all of the devices that the workload is split across function on the same layer [3]. To fully optimize a machine learning model, both horizontal and vertical partitioning must be used. 

[[File:Edge_computing_layers.png|400px|thumb|center|An example of different layers with multiple devices]]

==='''Distributed Learning'''===
Distributed learning is a process in which instead of giving all of the data to the nodes, each node takes in and processes only a subset of the overall data, then sends its results to a central server which contains the main model. These periodic updates are done by each node, and by only processing part of the data, it is much more easy for edge devices to handle these workloads. It can also reduce the time and computational burden on the cloud and network because it is not only the central server performing all of the computations. One popular method of accomplishing this  is by using parallel gradient descent. Gradient descent by itself is a very useful means of training a model, and parallel gradient descent uses a similar process but instead of a single operation, it aggregates the gradient descent calculated by each node in order to update the central model. This efficiently utilizes each node and makes sure that the data that is used to update the model does not exceed the memory constraints of the edge devices being used as nodes.

[[File:Distributed.png|600px|thumb|center| A visualization of data partitioning among nodes doing distributed learning]]


'''Asynchronous Scheduling:''' One important aspect of distributed learning is the means by which the updates are given to the model. Each node may have different amounts of data, memory, and computational power, and therefore the time it takes to process and update will not be the same. Synchronous algorithms make sure that each node finishes its respective processing, and then all of the calculations from all nodes are sent to update the central model. Although, this can make the updating easier, any nodes that lag behind the others will greatly reduce the speed at which the model is trained; this also leaves nodes idle while others are still processing and can be highly inefficient. To optimize this, after each node has finished its respective processing, it sends everything to the main server for an update rather than waiting for all the others to finish. However, this requires some more communication, as after each update, every node must get the updated model for the central server; this can be challenging to repeatedly do, but the efficiency that is gained is often worth it.

'''Federated learning:''' Federated learning is another means by which the data as well as computational burden is distributed among a set of nodes, thus reducing network traffic and strain on the central servers. It is similar to distributed learning, but the key difference lies in the data partitioning. Unlike distributed learning, data used in federated learning is not shared with the central servers. Instead, only local data from each of the nodes is used to train the model. Then, the only part shared with the central model is the updated parameters. A key aspect of this is that federated learning provides a greater amount of data privacy, which is crucial for certain applications dealing with sensitive data. Therefore, it is especially useful to utilize edge devices to perform federated learning. Federated learning is discussed in detail in chapter 5 of this wiki.

==='''Transfer Learning'''===
Transfer learning is a method of machine learning in which a pretrained model is sent to the different nodes for further processing and fine-tuning. The initial training of the model may involve a very large amount of data and could place a major burden on the device that must execute it, which may not be feasible for edge devices. Therefore, the bulk of the model training is done by a more powerful machine, such as the cloud, and then the pretrained model is sent to the edge device. This can be useful, as it reduces the computational burden on the edge device, and allows it to fine-tune the model using the data it collects without having to completely train the whole system. One form of this is knowledge distillation, in which a smaller model can be trained to mimic that of a larger model. This may often be the case when edge and cloud systems are used in a combined way.

==='''Methods of Data Optimization'''===

'''Data Selection:''' Certain types of data may be more useful than others, and therefore the ones that most affect the accuracy of a model can be offloaded to edge devices. This decreases the workload on them because less data must be processed, while also conserving the accuracy of the model as much as possible. Some larger data may not be needed, such as only putting Small Language Models on edge devices that can handle simple commands and prompts, rather than having to offload an entire LLM onto the device, which may overload its computational abilities and not provide enough use.

'''Data Compression:''' Data can be compressed to a smaller form in order to fit the constraints of edge devices. This is especially true given their limited memory, and also makes the workload smaller. Quantization, discussed previously, is a prevalent example of this.

'''Container Workloads:''' Container workloads can be very useful, as they provide all resources and important data the device needs for processing the work. By examining the computational abilities of the device, different sized workloads can be allocated as deemed necessary to maximize the efficiency of the training.

'''Batch size tuning:''' The batch size used by a certain model is very important when considering the memory constraints of edge devices. Batch sizes that are smaller allow for a quicker and more efficient means of training models, and are less likely to lead to bottlenecks in memory. This is related to partitioning because the computing and memory capacity of the devices available are very important factors to consider.

==='''Utilizing Small Language Models (SLMs)'''===
Large Language Models (LLMs) have become a prevalent system recently and are able to do and help with a variety of tasks. However, running and training an LLM requires a significant amount of computational resources which is not feasible when working with edge devices. Most modern LLMs are cloud based, but this may lead to high latency and increased network traffic, especially when working with a large subsystem of nodes. 

One way that a similar system can be achieved on edge devices is by using SLMs. These are not as accurate and do not have the vast knowledge of LLMs or the amount of data they are trained on, but for the purposes of basic applications and edge devices, they can be sufficient to accomplish many tasks. They are also often fine-tuned and trained to accomplish the specific tasks which they are deployed for and are much faster and resource-efficient than LLMs. They can also provide much more privacy because they are able to be run on local devices without sharing user data to the cloud. This can be useful for a wide variety of edge applications. If needed and privacy constraints permit, they can query and LLM as needed for more complicated tasks. This means that not every prompt leads to a query, and thus network traffic, privacy, and latency constraints are still preserved.

==='''Synthesis'''===

In summary, edge-oriented machine learning optimization requires an integrated approach that combines model-level compression with system-level orchestration. Techniques such as quantization, structured and unstructured pruning, and knowledge distillation reduce the computational footprint and memory requirements of deep learning models, enabling deployment on resource-constrained devices without substantial loss in inference accuracy. Concurrently, dynamic workload partitioning, heterogeneity-aware scheduling, and adaptive runtime profiling allow the system to allocate tasks across edge and cloud tiers based on real-time availability of compute, bandwidth, and energy resources. This joint optimization across model architecture and execution environment is essential to meet the latency, privacy, and resilience demands of edge AI deployments.