AI/LLM Cheatsheet 13 — Fine-tuning

Fine-tuning cheatsheet.

When to fine-tune

Need specific output format consistently.
Reduce prompt length (saves cost).
Stylistic / tone consistency.
Domain-specific terminology.
Smaller model that matches larger via fine-tune.

When NOT to:

Better prompting works.
Limited training data (<1000 examples).
Knowledge injection (use RAG).

Data format (OpenAI)

{"messages": [{"role": "system", "content": "..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
{"messages": [...]}

500-1000 examples typical minimum.

OpenAI fine-tune

# Upload data
openai files create --file train.jsonl --purpose fine-tune

# Start job
openai fine_tuning.jobs.create --model gpt-4o-mini-2024-07-18 --training_file file-abc

# Monitor
openai fine_tuning.jobs.list

# Use
openai chat.completions.create -m ft:gpt-4o-mini:org:custom:abc123

LoRA (Low-Rank Adaptation)

Adds small trainable matrices to frozen base model. ~1% of params.

from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=16,                              # rank
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)
model = get_peft_model(base_model, config)

QLoRA

LoRA on 4-bit quantized base. Fits 70B on single A100/H100.

from transformers import BitsAndBytesConfig

bnb = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)
model = AutoModelForCausalLM.from_pretrained("llama-4-8b", quantization_config=bnb)
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, lora_config)

Unsloth (faster)

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/llama-3.1-8b-instruct-bnb-4bit",
    max_seq_length=2048,
    load_in_4bit=True,
)
model = FastLanguageModel.get_peft_model(model, r=16, target_modules=[...])

# Train with HF TRL SFTTrainer

2x faster, less memory.

Training loop

from transformers import TrainingArguments, Trainer
from trl import SFTTrainer

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    args=TrainingArguments(
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        warmup_steps=10,
        num_train_epochs=3,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=10,
        output_dir="./output",
    ),
    max_seq_length=2048,
)
trainer.train()
trainer.save_model("./adapter")

DPO (preference tuning)

from trl import DPOTrainer

trainer = DPOTrainer(model, args=..., train_dataset=preference_dataset)

Data: (prompt, chosen, rejected) triples. Aligns model toward preferred outputs.

Eval after fine-tune

Always compare base vs fine-tuned on held-out test set. Watch for:

Catastrophic forgetting (lost general ability).
Overfitting (memorized train).
Bias amplification.

Merge LoRA back

merged = model.merge_and_unload()
merged.save_pretrained("./merged")

For deployment without PEFT runtime.

Cost

OpenAI fine-tune: ~$25 per 1M training tokens; inference ~3x base price.
LoRA on RunPod / Modal: $5-50 for typical run.
QLoRA on local M-series: free, slow.

Common mistakes

Too little data → overfit.
No validation set.
Skipping eval on base.
Long sequences (>2048) → OOM.
Mismatched train / inference template.

When to fine-tune#

Data format (OpenAI)#

OpenAI fine-tune#

LoRA (Low-Rank Adaptation)#

QLoRA#

Unsloth (faster)#

Training loop#

DPO (preference tuning)#

Eval after fine-tune#

Merge LoRA back#

Cost#

Common mistakes#

Read this next#