Editing Machine Learning at the Edge (section)

==='''Model Compression Techniques'''===
Despite the challenges, of ML at the edge, there are a variety of methods that can be used to provide a more efficient means of training, and making the heavy workloads compatible with even the limited computing power of certain edge devices.

'''Quantization:'''
Quantization is a method that involves reducing the precision of numbers, and thus easing the burden of computational power as well as memory management on the edge devices. There are multiple forms of quantization, but each one essentially sacrifices some precision - enough so that accuracy is mostly maintained but the numbers are easier to handle. For example, converting from floating point to integer datatypes means significantly less memory is used, and the differences for some models in the precision may be negligible. Another example is K-means based Weight Quantization, which involves creating a matrix and grouping similar numbers together with centroids. An example is shown below:

[[File:Screenshot_2025-04-24_170604.png|500px|thumb|right|By clustering each index in the matrix, and using centroids to approximate, the overall computations can be done much quicker and are more easily handled by edge devices]]

In recent work, Quantized Neural Networks (QNNs) have demonstrated that even extreme quantization—such as using just 1-bit values for weights and activations—can retain near state-of-the-art accuracy across vision and language tasks [12]. This type of quantization drastically reduces memory access requirements and replaces expensive arithmetic operations with fast, low-power bitwise operations like XNOR and popcount. These benefits are especially important for edge deployment, where energy efficiency is critical. In addition to model compression, Hubara et al. also show that quantized gradients—using as little as 6 bits—can be employed during training with minimal performance loss, further enabling efficient on-device learning [12]. QNNs have achieved strong results even on demanding benchmarks like ImageNet, while offering significant speedups and memory savings, making them one of the most practical solutions for edge AI deployment [12].


'''Pruning:'''
Pruning is an optimization technique that systematically removes low-salience parameters—such as weakly contributing weights or redundant hypothesis paths—from a machine learning model or decoding algorithm to reduce computational overhead. In the context of edge computing, where resources like memory bandwidth, power, and processing time are limited, pruning enables the deployment of performant models within strict efficiency constraints.

In statistical machine translation (SMT) systems, pruning is particularly critical during the decoding phase, where the search space of possible translations grows exponentially with sentence length. Techniques such as histogram pruning and threshold pruning are employed to manage this complexity. Histogram pruning restricts the number of candidate hypotheses retained in a decoding stack to a fixed size 𝑛, discarding the remainder. Threshold pruning eliminates hypotheses whose scores fall below a proportion 𝛼 of the best-scoring candidate, effectively filtering out weak candidates early.

The paper by Banik et al. introduces a machine learning-based dynamic pruning framework that adaptively tunes pruning parameters—namely stack size and beam threshold—based on structural features of the input text, such as sentence length, syntactic complexity, and the distribution of stop words. Rather than relying on static hyperparameters, this method uses a classifier (CN2 algorithm) trained on performance data to predict optimal pruning configurations at runtime. Experimental results showed consistent reductions in decoding latency (up to 90%) while maintaining or improving translation quality, as measured by BLEU scores [13].

This adaptive pruning paradigm is highly relevant to edge inference pipelines, where models must maintain a balance between latency and predictive accuracy. By intelligently limiting the hypothesis space and focusing computational resources on high-probability paths, pruning supports real-time, resource-efficient processing in edge NLP and embedded translation systems.

[[File:Pruning.png|350px|thumb|center|This shows how pruning can significantly reduce the overall network, thus leading to better computational and memory management]]


'''Distillation:'''
Distillation is a key strategy for reducing model complexity in edge computing environments. Instead of training a compact student model on hard labels—discrete class labels like 0, 1, or 2—it is trained on the soft outputs of a larger teacher model. These soft labels represent probability distributions over all classes, offering more nuanced supervision. For instance, rather than telling the student the input belongs strictly to class 3, a teacher might output “70% class 3, 25% class 2, 5% class 1.” This richer feedback helps the student model capture subtle relationships between classes that hard labels miss. Beyond reducing computational demands, distillation enhances generalization by conveying more informative training signals. It also benefits from favorable data geometry—when class distributions are well-separated and aligned—and exhibits strong monotonicity, meaning the student model reliably improves as more data becomes available [11]. These properties make it exceptionally suited for edge devices where training data may be limited, but efficient inference is crucial. 

In most cases, knowledge distillation in edge environments involves a large, high-capacity model trained in the cloud acting as the teacher, while the smaller, lightweight student model is deployed on edge devices. A less common—but emerging—practice is edge-to-edge distillation, where a more powerful edge node or edge server functions as the teacher for other nearby edge devices. This setup is especially valuable in federated, collaborative, or hierarchical edge networks, where cloud connectivity may be limited or privacy concerns necessitate local training. Distillation can also be combined with techniques such as quantization or pruning to further optimize model performance under hardware constraints. An example is shown below:

[[File:Knowledge_Distillation.png|700px|thumb|center|This shows how a complex teacher model transfers learned knowledge to a smaller student model using soft predictions to enable efficient edge deployment]]



{| class="wikitable" style="width:100%; text-align:left;"
|+ '''Comparison of Model Compression Techniques for Edge Deployment'''
! Technique
! Description
! Primary Benefit
! Trade-offs
! Ideal Use Case
|-
| '''Pruning'''
| Removes unnecessary weights or neurons from a neural network.
| Reduces model size and computation.
| May require retraining or fine-tuning to preserve accuracy.
| Useful for deploying models on devices with strict memory and compute constraints.
|-
| '''Quantization'''
| Converts high-precision values (e.g., 32-bit float) to lower precision (e.g., 8-bit integer or binary).
| Lowers memory usage and accelerates inference.
| Risk of precision loss, especially in very small or sensitive models.
| Ideal when real-time inference and power efficiency are essential.
|-
| '''Distillation'''
| Trains a smaller model (student) using the output probabilities of a larger, more complex teacher model.
| Preserves performance while reducing model complexity.
| Requires access to a trained teacher model and additional training data.
| Effective when deploying accurate, lightweight models under data or resource constraints.
|}