What is Knowledge Distillation?
“Teaching a fast learner to mimic a genius without the need for a PhD”
.
In the world of machine learning, knowledge distillation enables the passing down of wisdom of large language models to small models. It’s a technique that enables a large, pre-trained model (such as a LLM)—to impart its learned insights to a smaller, more efficient model. Traditional deep learning focuses on aligning a model’s predictions with the correct answers in a dataset. In contrast, distillation aims to have the small model replicate the refined outputs of a well-trained LLM.
State-of-the-art LLM models are typically massive, demanding substantial processing power to train and run. On the other hand, smaller models, while faster and more efficient, typically lack the depth, precision, and knowledge breadth of their larger counterparts. Large models are excellent at learning deep structures from data. Knowledge distillation provides a means to transfer their insights into a smaller AI model.
What is under the hood?
Knowledge distillation works universally across various types of neural networks. The bigger LLM and smaller models produced by knowledge distillation do not have to share the same AI deep learning architecture.
In machine learning, a trained model’s “knowledge” is represented by the parameters it has learned—specifically the weights and biases—that are applied throughout the various mathematical representation of the neurons within the neural network. In contrast, Knowledge Distillation views a model’s knowledge not as the exact mathematical parameters it acquires during training, but as its ability to generalize to new data following that training.
A loss function in machine learning and deep learning is a mathematical function that measures the difference or “error” between the predicted outputs of a model and the correct target values. It is a mathematical representation of how far off the model’s predictions are from the correct outcomes. The model parameters (see above) are tweaked to minimize this loss. By minimizing the loss function, the model becomes better at generalizing from the training data to make accurate predictions on unseen data. In contrast, knowledge distillation utilizes a distillation loss function. This special type of loss function allows a smaller model, with far fewer parameters, to achieve performance similar to that of the larger, master model.
Do we need knowledge distillation?
Knowledge distillation balances the trade-off between performance and efficiency, implementation costs vs accuracy. A model built employing knowledge distillation can achieve a level of understanding and accuracy that is often comparable to the original LLM but with significantly lower computational demands. In real-world applications, such as LLMs on mobile phones and low-latency AI systems, knowledge distillation provides a solution to option to scale down powerful language models while maintaining high performance. It makes it possible to balance the power of large models with the practical limitations of deployment and real-world applications.
In the news
DeepSeek, an AI model that has recently emerged (Jan 27th 2025), has surged to the top of Apple’s App Store, surpassing ChatGPT! Its developers claim it cost $6 million to train. In comparison it took “over $100 million” (as cited by OpenAI CEO Sam) to train GPT-4. Industry stalwarts believe DeepSeek could have employed knowledge distillation to replicate the performance of ChatGPT!