When Fine-Tuning Actually Makes Sense
Prompt engineering and RAG solve most use cases. Fine-tuning is worth the investment when you need a model to consistently produce a very specific output format or style that is hard to enforce through prompts alone, when inference latency is critical and you want a smaller model that matches a larger one's quality on your domain, or when your use case requires hundreds of thousands of inferences per day and the cost savings from a smaller fine-tuned model are significant.
LoRA (Low-Rank Adaptation) makes fine-tuning accessible. Instead of updating all billions of model parameters, LoRA inserts small trainable matrices alongside the frozen base model weights. A 7B parameter model that would require 8x A100s to full fine-tune can be LoRA fine-tuned on a single consumer GPU in hours.
Dataset Preparation
The most important factor in fine-tuning quality is dataset quality, not training time or hyperparameters. Aim for 500–5000 high-quality examples in instruction-response format. More low-quality data is worse than fewer high-quality examples.
# dataset format: JSONL with instruction/response pairs
{"instruction": "Summarize this support ticket and classify its urgency.",
"input": "Subject: App keeps crashing...",
"output": "{\"summary\": \"User reports consistent crash on checkout page\", \"urgency\": \"high\", \"category\": \"bug\"}"}
# Convert to HuggingFace datasets format
from datasets import Dataset
import json
data = [json.loads(l) for l in open('train.jsonl')]
dataset = Dataset.from_list(data)
dataset.push_to_hub('your-org/your-dataset')
For domain-specific tasks, generate synthetic training data using GPT-4o or Claude — prompt the model with examples of your task, generate 2000+ completions, then manually review and curate the best 500. This is often faster than labelling from scratch.
Training with Unsloth
Unsloth is the fastest way to fine-tune open-source models. It provides 2x faster training and 70% less memory usage compared to standard HuggingFace PEFT training, making it practical to fine-tune 7B models on a single GPU.
from unsloth import FastLanguageModel
import torch
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Meta-Llama-3.1-8B-Instruct",
max_seq_length=2048,
load_in_4bit=True,
)
model = FastLanguageModel.get_peft_model(
model,
r=16, # LoRA rank — higher = more capacity but slower
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_alpha=16,
lora_dropout=0,
bias="none",
use_gradient_checkpointing="unsloth",
random_state=42,
)
from trl import SFTTrainer
from transformers import TrainingArguments
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=2048,
args=TrainingArguments(
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
warmup_steps=10,
num_train_epochs=3,
learning_rate=2e-4,
fp16=not torch.cuda.is_bf16_supported(),
bf16=torch.cuda.is_bf16_supported(),
logging_steps=10,
output_dir="./outputs",
save_strategy="epoch",
),
)
trainer.train()
Evaluating Your Fine-Tuned Model
Never rely on training loss alone to judge a fine-tuned model. Build a held-out evaluation set of 100–200 examples not seen during training and measure task-specific metrics: accuracy for classification, ROUGE scores for summarisation, or a human preference score for generation tasks.
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained("./outputs/checkpoint-final")
FastLanguageModel.for_inference(model)
correct = 0
for example in eval_dataset:
inputs = tokenizer(example["instruction"], return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=256)
prediction = tokenizer.decode(outputs[0], skip_special_tokens=True)
if evaluate(prediction, example["output"]):
correct += 1
print(f"Accuracy: {correct / len(eval_dataset):.2%}")
Saving and Serving the Model
Export your fine-tuned model to GGUF format for efficient CPU inference with llama.cpp, or merge the LoRA weights into the base model and push to HuggingFace for vLLM serving.
# Merge LoRA into base model and save
model.save_pretrained_merged("merged-model", tokenizer, save_method="merged_16bit")
# Export to GGUF for llama.cpp
model.save_pretrained_gguf("model-gguf", tokenizer, quantization_method="q4_k_m")
# Push merged model to HuggingFace
model.push_to_hub_merged("your-org/your-fine-tuned-model", tokenizer, save_method="merged_16bit")
For production serving, push the merged model to HuggingFace and deploy it with vLLM. The fine-tuned model serves via the same OpenAI-compatible API as any other vLLM-hosted model, making it a drop-in replacement for the base model.
Common Fine-Tuning Mistakes
- Too little data: Under 200 examples rarely produces meaningful specialisation. 1000+ is a safer minimum.
- Training too long: More epochs on a small dataset causes overfitting. Watch validation loss and stop when it plateaus.
- Forgetting catastrophic forgetting: Fine-tuning for a narrow task can degrade general capabilities. Test the model on general prompts after fine-tuning, not just your target task.
- Skipping the base model evaluation: Always evaluate the base model on your eval set before fine-tuning. If the base model already performs at 85% accuracy, fine-tuning for 88% may not justify the complexity.