Jump to content
Main menu
Main menu
move to sidebar
hide
Navigation
Main page
Recent changes
Random page
Help about MediaWiki
Edge Computing Wiki
Search
Search
Appearance
Create account
Log in
Personal tools
Create account
Log in
Pages for logged out editors
learn more
Contributions
Talk
Editing
Machine Learning at the Edge
(section)
Page
Discussion
British English
Read
Edit
View history
Tools
Tools
move to sidebar
hide
Actions
Read
Edit
View history
General
What links here
Related changes
Upload file
Special pages
Page information
Appearance
move to sidebar
hide
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
==='''Model Compression Techniques'''=== Despite the challenges, of ML at the edge, there are a variety of methods that can be used to provide a more efficient means of training, and making the heavy workloads compatible with even the limited computing power of certain edge devices. '''Quantization:''' Quantization is a method that involves reducing the precision of numbers, and thus easing the burden of computational power as well as memory management on the edge devices. There are multiple forms of quantization, but each one essentially sacrifices some precision - enough so that accuracy is mostly maintained but the numbers are easier to handle. For example, converting from floating point to integer datatypes means significantly less memory is used, and the differences for some models in the precision may be negligible. Another example is K-means based Weight Quantization, which involves creating a matrix and grouping similar numbers together with centroids. An example is shown below: [[File:Screenshot_2025-04-24_170604.png|500px|thumb|right|By clustering each index in the matrix, and using centroids to approximate, the overall computations can be done much quicker and are more easily handled by edge devices]] In recent work, Quantized Neural Networks (QNNs) have demonstrated that even extreme quantizationâsuch as using just 1-bit values for weights and activationsâcan retain near state-of-the-art accuracy across vision and language tasks [12]. This type of quantization drastically reduces memory access requirements and replaces expensive arithmetic operations with fast, low-power bitwise operations like XNOR and popcount. These benefits are especially important for edge deployment, where energy efficiency is critical. In addition to model compression, Hubara et al. also show that quantized gradientsâusing as little as 6 bitsâcan be employed during training with minimal performance loss, further enabling efficient on-device learning [12]. QNNs have achieved strong results even on demanding benchmarks like ImageNet, while offering significant speedups and memory savings, making them one of the most practical solutions for edge AI deployment [12]. '''Pruning:''' Pruning is an optimization technique that systematically removes low-salience parametersâsuch as weakly contributing weights or redundant hypothesis pathsâfrom a machine learning model or decoding algorithm to reduce computational overhead. In the context of edge computing, where resources like memory bandwidth, power, and processing time are limited, pruning enables the deployment of performant models within strict efficiency constraints. In statistical machine translation (SMT) systems, pruning is particularly critical during the decoding phase, where the search space of possible translations grows exponentially with sentence length. Techniques such as histogram pruning and threshold pruning are employed to manage this complexity. Histogram pruning restricts the number of candidate hypotheses retained in a decoding stack to a fixed size đ, discarding the remainder. Threshold pruning eliminates hypotheses whose scores fall below a proportion đź of the best-scoring candidate, effectively filtering out weak candidates early. The paper by Banik et al. introduces a machine learning-based dynamic pruning framework that adaptively tunes pruning parametersânamely stack size and beam thresholdâbased on structural features of the input text, such as sentence length, syntactic complexity, and the distribution of stop words. Rather than relying on static hyperparameters, this method uses a classifier (CN2 algorithm) trained on performance data to predict optimal pruning configurations at runtime. Experimental results showed consistent reductions in decoding latency (up to 90%) while maintaining or improving translation quality, as measured by BLEU scores [13]. This adaptive pruning paradigm is highly relevant to edge inference pipelines, where models must maintain a balance between latency and predictive accuracy. By intelligently limiting the hypothesis space and focusing computational resources on high-probability paths, pruning supports real-time, resource-efficient processing in edge NLP and embedded translation systems. [[File:Pruning.png|350px|thumb|center|This shows how pruning can significantly reduce the overall network, thus leading to better computational and memory management]] '''Distillation:''' Distillation is a key strategy for reducing model complexity in edge computing environments. Instead of training a compact student model on hard labelsâdiscrete class labels like 0, 1, or 2âit is trained on the soft outputs of a larger teacher model. These soft labels represent probability distributions over all classes, offering more nuanced supervision. For instance, rather than telling the student the input belongs strictly to class 3, a teacher might output â70% class 3, 25% class 2, 5% class 1.â This richer feedback helps the student model capture subtle relationships between classes that hard labels miss. Beyond reducing computational demands, distillation enhances generalization by conveying more informative training signals. It also benefits from favorable data geometryâwhen class distributions are well-separated and alignedâand exhibits strong monotonicity, meaning the student model reliably improves as more data becomes available [11]. These properties make it exceptionally suited for edge devices where training data may be limited, but efficient inference is crucial. In most cases, knowledge distillation in edge environments involves a large, high-capacity model trained in the cloud acting as the teacher, while the smaller, lightweight student model is deployed on edge devices. A less commonâbut emergingâpractice is edge-to-edge distillation, where a more powerful edge node or edge server functions as the teacher for other nearby edge devices. This setup is especially valuable in federated, collaborative, or hierarchical edge networks, where cloud connectivity may be limited or privacy concerns necessitate local training. Distillation can also be combined with techniques such as quantization or pruning to further optimize model performance under hardware constraints. An example is shown below: [[File:Knowledge_Distillation.png|700px|thumb|center|This shows how a complex teacher model transfers learned knowledge to a smaller student model using soft predictions to enable efficient edge deployment]] {| class="wikitable" style="width:100%; text-align:left;" |+ '''Comparison of Model Compression Techniques for Edge Deployment''' ! Technique ! Description ! Primary Benefit ! Trade-offs ! Ideal Use Case |- | '''Pruning''' | Removes unnecessary weights or neurons from a neural network. | Reduces model size and computation. | May require retraining or fine-tuning to preserve accuracy. | Useful for deploying models on devices with strict memory and compute constraints. |- | '''Quantization''' | Converts high-precision values (e.g., 32-bit float) to lower precision (e.g., 8-bit integer or binary). | Lowers memory usage and accelerates inference. | Risk of precision loss, especially in very small or sensitive models. | Ideal when real-time inference and power efficiency are essential. |- | '''Distillation''' | Trains a smaller model (student) using the output probabilities of a larger, more complex teacher model. | Preserves performance while reducing model complexity. | Requires access to a trained teacher model and additional training data. | Effective when deploying accurate, lightweight models under data or resource constraints. |}
Summary:
Please note that all contributions to Edge Computing Wiki may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see
Edge Computing Wiki:Copyrights
for details).
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)