Instruction Tuning

Instruction tuning fine-tunes a pretrained language model on examples of natural language instructions and their desired responses. It transforms a base model (good at text completion) into an assistant model (good at following instructions).

The Problem Instruction Tuning Solves

A pretrained base LLM learns to continue text. Given “What is the capital of France?”, it might continue with another question rather than an answer. It does not inherently know it should act as a helpful assistant.

Instruction tuning teaches the model to recognize and follow instructions by training on diverse (instruction, response) pairs.

FLAN (Fine-tuned Language Net)

Wei et al. (2022). Fine-tune T5 on a mixture of NLP tasks framed as natural language instructions.

Key finding: instruction tuning on 60+ tasks improved zero-shot performance on held-out tasks significantly. Generalization to unseen tasks improved with the number and diversity of training tasks.

Instruction templates: each task is expressed in multiple natural language templates:

Translate the following English sentence to French: {sentence}
What is {sentence} in French?
How would you say {sentence} in French?

Multiple templates per task increase diversity and generalization.

FLAN-T5 / FLAN-PaLM: extended FLAN to larger models and more tasks. FLAN-T5 is widely used as a strong open-source instruction-following baseline.

InstructGPT (OpenAI 2022)

Three-stage process that became the blueprint for modern assistant models:

  1. Supervised Fine-Tuning (SFT): collect a dataset of (prompt, ideal response) pairs written by human contractors. Fine-tune GPT-3 on this data.

  2. Reward Model (RM): collect human preference comparisons (which of two responses is better?). Train a classifier to predict preference. See RLHF.

  3. RL fine-tuning (PPO): optimize the SFT model with PPO to maximize RM scores, with a KL penalty to prevent divergence from SFT.

Key finding: humans preferred InstructGPT outputs over GPT-3 outputs despite InstructGPT being 100$\times$ smaller.

Self-Instruct

Wang et al. (2022). Generate instruction-following training data using the LLM itself.

  1. Start with 175 manually written seed tasks.
  2. Prompt the LLM to generate new tasks.
  3. Filter out low-quality and duplicate instructions.
  4. Prompt the LLM to generate responses for the new instructions.
  5. Fine-tune on the synthetic (instruction, response) pairs.

Alpaca (Stanford): fine-tuned LLaMA-7B on 52k instructions generated by GPT-3.5 via Self-Instruct. Surprisingly effective despite small size.

Open-Source Instruction Datasets

Dataset Size Source
Alpaca 52k GPT-3.5 generated
Dolly 15k 15k Human-written
OpenAssistant 161k Human conversations
FLAN Collection 1.8M+ Task templates
ShareGPT varies Real ChatGPT conversations
UltraChat 1.5M Multi-turn synthetic
OpenHermes 900k Curated synthetic
SlimOrca 500k Filtered FLAN + GPT-4

Data quality matters more than quantity. LLaMA-3 instruction fine-tuning used tens of millions of curated high-quality examples. LIMA (Zhou et al. 2023): only 1000 carefully curated examples achieves competitive results with much larger datasets.

Chat Templates

Instruction-tuned models use specific conversation templates that format the multi-turn dialogue.

Llama-3 template:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful assistant.<|eot_id|>
<|start_header_id|>user<|end_header_id|>

What is the capital of France?<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>

The capital of France is Paris.<|eot_id|>

The model is trained to generate only the assistant turn given the system + user turns. Special tokens mark turn boundaries.

Task Diversity and Generalization

More diverse instruction types → better generalization. Training tasks should include:

  • Classification, NER, NLI, summarization (NLP fundamentals).
  • Open-ended generation (creative writing, explanation).
  • Instruction following (format specification, length control).
  • Reasoning (math, logic, code).
  • Dialogue (multi-turn, roleplay).
  • Safety (refusals for harmful content).

Instruction Tuning vs. Full Fine-tuning

Instruction tuning changes behavior, not knowledge. The base model’s parametric knowledge is largely unchanged; instruction tuning teaches it to express that knowledge helpfully.

Full fine-tuning on domain data updates the model’s knowledge. Combines pretraining on new data with instruction tuning.

PEFT for instruction tuning: LoRA/QLoRA fine-tune efficiently. Rank $r = 8$–$64$ for instruction tuning; higher rank for domain adaptation.