July 31, 2025

Phi Silica task specialization using LoRA in Microsoft Learning Zone: A technical deep dive

Shay Ben-Elazar – Principal Applied Science Manager, Microsoft Education

At Build 2025, we announced support for LoRA (low-rank-adaptation) finetuning for Phi Silica – our inbox Small Language Model (SLM) that runs locally on Copilot+ PCs. LoRA makes fine-tuning more efficient by updating only a small subset of parameters of the model with custom data. This allows improved performance on desired tasks without affecting model’s overall abilities.

This post shares the behind-the-scenes work and design considerations that enabled us to customize generation for a real-world use case: generating high-quality, pedagogically valuable Kahoot! quizzes. Our efforts led to a 75% reduction in rejection rates and show a 4.6X uplift in subjective quality scores¹.

Microsoft Learning Zone: Generating Kahoot! games on-device

Earlier this year, we introduced Microsoft Learning Zone (under the code name “Project Spark”), Microsoft’s first learning companion app designed specifically for Copilot+ PCs. It empowers educators to effortlessly create interactive and personalized lessons using on-device AI – at no cost.

As part of this initiative, we partnered with Kahoot!, the beloved learning platform, to enable the creation of engaging classroom games powered entirely by Phi Silica.

Microsoft Learning Zone supports a wide range of generation tasks with varying pedagogical requirements – from dynamic introductions and multiple-choice formats to customizable refinement flows. Naturally, training and distributing a custom fine-tuned model for each generation task would be inefficient and impractical. Instead, we leveraged LoRA adapters to specialize a single, base Phi Silica model to diverse task needs with minimal overhead.

Defining quality: verifiable vs. subjective

Kahoot! quizzes consist of multiple-choice questions, and evaluating their quality is subjective – combining structural requirements with human judgement. We defined two axes of quality:

Verifiable quality

This includes clearly defined output format constraints – like maximum character lengths for questions and answers – aligned with Kahoot!’s UX across devices. These are enforced as hardcoded guardrails in Microsoft Learning Zone’s real-time generation pipeline. Generated Kahoot! multiple-choice questions are streamed to the user for review only if they successfully pass through these guardrails. Reducing the rejection rate due to guardrail violations directly improves user-perceived latency, as discarded generations increase the delay until a user can review a newly generated question.

Subjective quality

Subjective quality addresses attributes like engagement, clarity and educational relevance which are not easy to measure quantitatively but are important for user perception and user satisfaction. In collaboration with the Kahoot! team, we defined a rubric and guidelines for human annotators and then used those insights to scale human evaluation via a novel agentic framework – more on that below.

Dataset curation and distillation

To enable effective LoRA finetuning, we curated a high-quality dataset grounded in real-world Microsoft Learning Zone use, as described below. The challenge was to provide Phi Silica with educationally rich, diverse content it could learn from to adapt the behavior of the model while only finetuning ~1% of the parameters with LoRA adapters.

Instead of relying solely on base model outputs, we adopted a distillation approach, using a leading LLM as a teacher. We applied it to generate synthetic Kahoot!-style Q&A tuples from curated learning materials. This approach allowed us to bootstrap a training dataset with higher initial quality and coverage.

The Microsoft Learning Zone pipeline starts by ingesting curated learning materials and extracting key facts and segments. Each segment is processed independently to ensure reasoning focus and tractability due to model context length constraints. For each extracted segment-key fact pair, we prompted GPT-4o to generate a single Kahoot!-style question centered on the fact. We passed each generated question through Microsoft Learning Zone’s guardrails that validate hard constraints such as character length limits on questions and answers – aligned with Kahoot!’s UI guidelines.

In total, we generated approximately 13,000 synthetic examples, which were then split into 10,000 training and 3,000 testing subsets, accordingly. Curating datasets for AI systems is one of the most overlooked but important pieces of the puzzle to making AI systems work effectively for the scenario at hand.

LoRA finetuning with AI toolkit

By using the recently released Phi Silica LoRA finetuning feature in AI toolkit, we were able to train LoRA adapters against the (quantized) Phi Silica model. The adapters produced run locally to customize the output of Phi Silica and match the requirements of our Kahoot! feature.

System prompt considerations

A system prompt grounds the model with context, instructions or other information relevant to a specific use case. It can define the following in a model’s response:

What persona should be used
What the model should and shouldn’t answer
Format of model’s response

Though it may seem intuitive to provide more information in a system prompt, it costs resources and performance by using tokens in context length and adding latency to a model’s response. The original system prompt needed to get the desired output format from our Phi Silica model was lengthy – it required specifying the format of the JSON table and providing detailed descriptions of the types of questions and answers we wanted.

After customization of our Phi Silica model, we were able to use a shorter system prompt since those details (format, persona, etc.) had been encoded in the LoRA adapter during training. This proved useful since the output required was very different from the base model’s default output.

For example, we used a short prompt that pushed the base model’s output in the direction we wanted while restricting it to a couple of sentences. The prompt we used for training was:

“You will be given a fact and some additional context. Respond with a relevant question, one correct answer and some incorrect answers.”

With this prompt, the base model gave answers in plain English with questions and answers that were based on the fact and context provided. However, the style of the questions and answers did not match what we wanted, and we needed the output to have a specific JSON format.

When performing inference with the LoRA adapter, we often got responses that had good questions and answers, but sometimes the JSON format was still not exactly what we provided in the training set. The solution was to reinforce that format in the system prompt we used for inference:

You will be given a fact and some additional context. Respond with a relevant question, one correct answer and some incorrect answers. Reply with a strict JSON string for class {question: string, answers: [{answer: string, correct: bool}], gettyImage: string}, wrapped in ```json tags.

The combination of the LoRA adapter and this new system prompt gave us the desired output.

Hyperparameter selection

In addition to changes to the system prompt, changes to the hyperparameters used during LoRA adapter training may improve output quality.

The default AI toolkit hyperparameters offer a solid starting point but adjusting them can optimize results for specific scenarios. We evaluated various settings on a smaller dataset for faster experiments. Training remained stable near default values, while extreme settings led to failed convergence – evidenced by stagnant loss and poor output – indicating the importance of staying close to defaults.

To identify which adapters perform best, it is important to include some evaluations during experiments. At this stage of parameter exploration, we used a simplified agent-as-a-judge assessment, as described in the following section.

For our Kahoot! use case, we did not find any parameter combination that worked better than our defaults. However, experimenting gave us confidence in our defaults.

Once confident in our system prompt and hyperparameters, we froze them and proceeded with longer training runs. Transitioning from exploration to exploitation in LoRA adapter training, we used the full dataset and increased early stopping patience, allowing extended training if evaluation numbers showed improvement.

Evaluating model quality

Verifiable quality: guardrail pass rate

The customized system with Phi Silica + LoRA adapter showed a 75% reduction in rejection rate, measured as statistically significant via guardrails². This directly improved user experience by reducing failed generations and perceived latency.

Subjective quality: agent-as-a-judge evaluation

Human evaluation costs time and resources. To scale subjective assessment, we built a multi-agentic evaluation framework using Autogen. Unlike traditional LLM-as-a-judge approaches, this framework simulates a review team of AI agents engaging in deliberative conversation to deliver nuanced, balanced and multi-perspective assessments. To do this we instructed the ‘review team’ with a set of quality measures we want to evaluate for each question. We leverage this framework to accelerate preliminary offline quality assessment. After these initial evaluations, we gradually validate results against multiple layers of human reviewers as part of Microsoft’s responsible product release practices.

We share our base code for further research.

Agent roles

The evaluation framework consisted of several personas engaged in a discussion, here we detail the agents involved and their tasked roles:

The Reviewer agent was tasked with evaluating each quality attribute using a chain-of-thought (CoT) approach, asking it to provide an initial justification and then a score for each quality criterion.
The Critic agent received the same context as the Reviewer agent along with the Reviewer’s evaluation. It was instructed to challenge the Reviewer’s reasoning – either proposing alternative scores or reinforcing agreement with reasoned justification.

This dialogue continues iteratively until a convergence point is reached, at which stage the Meta-Reviewer is invoked.

The Meta-Reviewer reviews the full conversation between the Reviewer and Critic, weighs their arguments along with the base context, and issues a final verdict. This final score, whether it aligns with or diverges from the previous agents, is treated as the final output of the evaluation framework.

Quality metrics

The review team created a set of metrics they wish to evaluate in the generated questions; here are these metrics exactly as they appear in the system prompt of each of the agent components.

Educational Value: whether the question teaches or tests a nontrivial concept that is highlighted in the given context.
Clarity and Phrasing: whether the wording is clear, precise, grammatically correct and understandable without confusion (e.g.,no double negatives).
Correct Answers Quality: whether the correct answers fully answer the question.
Distractors Quality: whether the distractors (incorrect answers) are reasonable and originate from a similar context, ensuring that someone unfamiliar with the material could be misled. The distractors must be wrong answers to the question.
Focus: whether the question targets a single, clear idea without mixing unrelated concepts.
Conciseness: whether the question is concise and to the point avoiding unnecessary complexity or verbosity (especially since it is presented in a Kahoot! activity).

Agent-as-a-judge evaluation results

Using a scoring framework we crafted with the Kahoot! team, our agents rated each question across key quality metrics, from 1 to 10. Our results show that LoRA outperforms the Phi Silica-base model across all quality attributes, as seen in the graph below.

Graph showing average quality scores for Phi Silica vs. Phi Silica + LoRA across six evaluation aspects, with 95% confidence intervals.

Figure 1: Average quality scores for Phi Silica vs. Phi Silica + LoRA across six evaluation aspects, with 95% confidence intervals

The results show that the Phi Silica + LoRA model consistently outperforms the baseline Phi Silica model (without customization) across all six quality aspects of question generation, including clarity, correctness and educational value. Notably, the most significant improvements are seen in the quality of correct answers and quality of distractors (incorrect answers), while Phi Silica + LoRA achieves both higher average scores and narrower confidence intervals. The 95% confidence intervals indicate the statistical reliability of these findings – where non-overlapping intervals between models suggest that LoRA’s improvements are not due to chance. Overall, the results highlight a meaningful and statistically robust gain in quality enabled by LoRA fine-tuning.

Overall, we found that our agent-as-a-judge favored the samples generated by the Phi Silica + LoRA model in 22.5% of the cases, compared to 14.5% for those generated by the base model.

Graph showing Base vs Base + LoRA A/B test overall “win rate” according to our agent-as-a-judge evaluation system, where the base model is Phi Silica.

Figure 2: Base vs Base + LoRA A/B test overall “win rate” according to our agent-as-a-judge evaluation system, where the base model is Phi Silica

Comparison with human judgement

To assess the effectiveness of using Phi Silica with a LoRA adapter compared to Phi Silica alone, we conducted an A/B test via a human evaluation study. The study included 2,350 paired samples generated by both models, with one human preference collected for each pair.

Annotators were shown pairs of multiple-choice questions generated by the two models from the same input context. For each pair, annotators selected the better question based on the same criteria defined in our agent-as-a-judge evaluation framework. Each annotator received the original context along with both generated questions and their answer choices and was asked to choose the preferred question in a blind AB test. The results show a trend of favoring the Phi Silica + LoRA model with a powerful effect size measured as 4.6X uplift, which can be seen in Figure 3.

Graph showing human preference in a blind AB test on question generated by Phi Silica and Phi Silica + LoRA.

Figure 3: Human preference in a blind AB test on question generated by Phi Silica and Phi Silica + LoRA

Our framework produces a set of scores across selected quality metrics. To aggregate these into a single score per question, we averaged the metric scores. We then compared the aggregated scores of each question pair to determine a model preference. Finally, we compared these preferences to those of human annotators to assess alignment. The results show that the framework achieves 79.5% accuracy and an F1 score of 77.3 in predicting human preference. We note that human preferences may differ in how they weight various quality metrics. While our evaluation relied on a simple average across all metrics, individuals are likely to prioritize certain aspects of quality over others, leading to potential misalignment.

We share the model’s confusion matrix in Figure 4 below.

Figure 4: Normalized binary confusion matrix on a balanced dataset, comparing Phi Silica before and after adding the LoRA adapter

Summary

The work done by the Microsoft Education team is a real-life example of how LoRA adapters are a cost-effective, lightweight option for customization of Phi Silica for task specific scenarios like generating Kahoot! quizzes.

Instead of training the larger Phi Silica base model, training was done on the smaller LoRA adapter using a curated dataset grounded in Microsoft Learning Zone’s guardrails (ex. aligning to Kahoot!’s UI guidelines). Customization via trained LoRA adapters shortened prompt requirements that improved efficiency while maintaining desired output structures. The default AI toolkit hyper parameters were validated, leading to extended training with fixed parameters.

Output quality was assessed by agents and verified through human testing for added reliability. Between the two tests, the Kahoot! quizzes generated by the Phi Silica + LoRA model were more favored by both the agents and humans.

Figure 5: LoRA distillation and evaluation flow overview

Ultimately, these efforts led to a 75% reduction in rejection rates and a 4.6X uplift in subjective quality scores or AI generated Kahoot! quizzes.

Kahoot! game generation through Microsoft Learning Zone will launch to public preview for Educators to experiment with later this summer. This work demonstrates how small models, when carefully adapted, can deliver robust, personalized AI experiences – even within constrained environments like on-device learning tools.

Read more about leveraging LoRA with Phi Silica on Copilot+ PC devices at our 2025 Build announcement page.

Acknowledgements:

Mousa Arraf – Applied Science Intern

Ella Ben-Tov – Principal Product Manager

Pashmina Cameron – Principal Applied Science Manager

Henry Jackson-Flux – Senior Applied Scientist

Merav Mofaz – Senior Applied Scientists

Endnotes:

¹ Metrics jointly defined with Kahoot, data generated by Microsoft as further defined below, analysis done on May 13, 2025.

² Measurements performed on data generated by Microsoft, analysis done on May 13, 2025.