Fine-Tuning LLMs with LoRA: Building Custom Models on a Budget

When Fine-Tuning Actually Makes Sense

Prompt engineering and RAG solve most use cases. Fine-tuning is worth the investment when you need a model to consistently produce a very specific output format or style that is hard to enforce through prompts alone, when inference latency is critical and you want a smaller model that matches a larger one's quality on your domain, or when your use case requires hundreds of thousands of inferences per day and the cost savings from a smaller fine-tuned model are significant.

LoRA (Low-Rank Adaptation) makes fine-tuning accessible. Instead of updating all billions of model parameters, LoRA inserts small trainable matrices alongside the frozen base model weights. A 7B parameter model that would require 8x A100s to full fine-tune can be LoRA fine-tuned on a single consumer GPU in hours.

Dataset Preparation

The most important factor in fine-tuning quality is dataset quality, not training time or hyperparameters. Aim for 500–5000 high-quality examples in instruction-response format. More low-quality data is worse than fewer high-quality examples.

# dataset format: JSONL with instruction/response pairs
{"instruction": "Summarize this support ticket and classify its urgency.",
 "input": "Subject: App keeps crashing...",
 "output": "{\"summary\": \"User reports consistent crash on checkout page\", \"urgency\": \"high\", \"category\": \"bug\"}"}

# Convert to HuggingFace datasets format
from datasets import Dataset
import json

data = [json.loads(l) for l in open('train.jsonl')]
dataset = Dataset.from_list(data)
dataset.push_to_hub('your-org/your-dataset')

For domain-specific tasks, generate synthetic training data using GPT-4o or Claude — prompt the model with examples of your task, generate 2000+ completions, then manually review and curate the best 500. This is often faster than labelling from scratch.

Training with Unsloth

Unsloth is the fastest way to fine-tune open-source models. It provides 2x faster training and 70% less memory usage compared to standard HuggingFace PEFT training, making it practical to fine-tune 7B models on a single GPU.

from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B-Instruct",
    max_seq_length=2048,
    load_in_4bit=True,
)

model = FastLanguageModel.get_peft_model(
    model,
    r=16,                       # LoRA rank — higher = more capacity but slower
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=42,
)

from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=2048,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=10,
        num_train_epochs=3,
        learning_rate=2e-4,
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        logging_steps=10,
        output_dir="./outputs",
        save_strategy="epoch",
    ),
)
trainer.train()

Evaluating Your Fine-Tuned Model

Never rely on training loss alone to judge a fine-tuned model. Build a held-out evaluation set of 100–200 examples not seen during training and measure task-specific metrics: accuracy for classification, ROUGE scores for summarisation, or a human preference score for generation tasks.

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained("./outputs/checkpoint-final")
FastLanguageModel.for_inference(model)

correct = 0
for example in eval_dataset:
    inputs = tokenizer(example["instruction"], return_tensors="pt").to("cuda")
    outputs = model.generate(**inputs, max_new_tokens=256)
    prediction = tokenizer.decode(outputs[0], skip_special_tokens=True)
    if evaluate(prediction, example["output"]):
        correct += 1

print(f"Accuracy: {correct / len(eval_dataset):.2%}")

Saving and Serving the Model

Export your fine-tuned model to GGUF format for efficient CPU inference with llama.cpp, or merge the LoRA weights into the base model and push to HuggingFace for vLLM serving.

# Merge LoRA into base model and save
model.save_pretrained_merged("merged-model", tokenizer, save_method="merged_16bit")

# Export to GGUF for llama.cpp
model.save_pretrained_gguf("model-gguf", tokenizer, quantization_method="q4_k_m")

# Push merged model to HuggingFace
model.push_to_hub_merged("your-org/your-fine-tuned-model", tokenizer, save_method="merged_16bit")

For production serving, push the merged model to HuggingFace and deploy it with vLLM. The fine-tuned model serves via the same OpenAI-compatible API as any other vLLM-hosted model, making it a drop-in replacement for the base model.

Common Fine-Tuning Mistakes

Too little data: Under 200 examples rarely produces meaningful specialisation. 1000+ is a safer minimum.
Training too long: More epochs on a small dataset causes overfitting. Watch validation loss and stop when it plateaus.
Forgetting catastrophic forgetting: Fine-tuning for a narrow task can degrade general capabilities. Test the model on general prompts after fine-tuning, not just your target task.
Skipping the base model evaluation: Always evaluate the base model on your eval set before fine-tuning. If the base model already performs at 85% accuracy, fine-tuning for 88% may not justify the complexity.

Fine-Tuning LLMs with LoRA: Building Custom Models on a Budget

When Fine-Tuning Actually Makes Sense

Dataset Preparation

Training with Unsloth

Evaluating Your Fine-Tuned Model

Saving and Serving the Model

Common Fine-Tuning Mistakes

Jaspi.io — AI Hiring Platform

How to Build a Production RAG System with LangChain and OpenAI

Building Multi-Agent AI Systems with LangGraph

Want to Build This for Your Team?