Fine-Tuning LLMs: A Practical Guide for Custom AI Models
Fine-tuning allows you to adapt powerful pre-trained language models to your specific needs. Whether you’re building a customer service bot, a medical assistant, or a code generator, fine-tuning can dramatically improve performance.
When to Fine-Tune
Fine-Tuning is Right For:
- Domain-specific tasks: Legal, medical, financial language
- Style adaptation: Matching your brand voice
- Consistent formatting: Structured outputs like JSON
- Proprietary knowledge: Company-specific information
Prompt Engineering is Better For:
- Simple tasks: Quick experiments, one-off queries
- Rapid prototyping: Testing ideas before investing
- General knowledge: Tasks within model’s training data
- Budget constraints: No training infrastructure needed
Fine-Tuning Methods
1. Full Fine-Tuning
Update all model parameters. Most comprehensive but expensive.
Pros: Best performance Cons: Requires significant compute, risk of catastrophic forgetting
2. Parameter-Efficient Fine-Tuning (PEFT)
Update only a small subset of parameters.
LoRA (Low-Rank Adaptation)
from peft import LoraConfig, get_peft_model
config = LoraConfig(
r=16, # rank
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, config)
Pros: 1000x fewer parameters, faster training, smaller checkpoints Cons: Slightly lower performance than full fine-tuning
QLoRA
Quantized LoRA for even lower memory usage.
from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16
)
Step-by-Step Fine-Tuning Tutorial
Step 1: Environment Setup
pip install transformers datasets accelerate peft bitsandbytes
Step 2: Prepare Your Dataset
from datasets import Dataset
# Example: Customer support conversations
data = [
{
"instruction": "How do I reset my password?",
"input": "",
"output": "To reset your password, click 'Forgot Password' on the login page. Enter your email address and check your inbox for a reset link. The link expires in 24 hours."
},
# Add more examples...
]
dataset = Dataset.from_list(data)
# Format for training
def format_prompt(example):
if example["input"]:
prompt = f"### Instruction:\n{example['instruction']}\n\n### Input:\n{example['input']}\n\n### Response:\n{example['output']}"
else:
prompt = f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['output']}"
return {"text": prompt}
dataset = dataset.map(format_prompt)
Step 3: Load Base Model
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "meta-llama/Llama-2-7b-hf" # or "mistralai/Mistral-7B-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config, # For QLoRA
device_map="auto"
)
Step 4: Configure LoRA
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
model = prepare_model_for_kbit_training(model)
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
Step 5: Set Up Training
from transformers import TrainingArguments, Trainer
training_args = TrainingArguments(
output_dir="./results",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
optim="paged_adamw_32bit",
save_steps=100,
logging_steps=10,
learning_rate=2e-4,
weight_decay=0.001,
fp16=True,
warmup_ratio=0.03,
group_by_length=True,
lr_scheduler_type="constant"
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=dataset,
data_collator=transformers.DataCollatorForSeq2Seq(
tokenizer, pad_to_multiple_of=8, return_tensors="pt", padding=True
)
)
Step 6: Train
model.config.use_cache = False # Enable for training
trainer.train()
Step 7: Save and Export
# Save LoRA adapter
model.save_pretrained("./lora-adapter")
# Merge with base model (optional)
from peft import AutoPeftModelForCausalLM
merged_model = model.merge_and_unload()
merged_model.save_pretrained("./merged-model")
Dataset Best Practices
Data Quantity
| Model Size | Minimum Examples | Recommended |
|---|---|---|
| 7B | 100-500 | 1,000-5,000 |
| 13B | 200-1,000 | 2,000-10,000 |
| 70B | 500-2,000 | 5,000-20,000 |
Data Quality Guidelines
- Format consistency: Use the same template for all examples
- Output quality: Examples should represent your desired output
- Diversity: Cover edge cases and variations
- Length variety: Mix short and long responses
- Clean data: Remove duplicates and errors
Data Format Example
{
"instruction": "Summarize the following article",
"input": "[Article text here...]",
"output": "[High-quality summary...]"
}
Hyperparameter Tuning
Key Parameters
| Parameter | Description | Typical Range |
|---|---|---|
| learning_rate | Step size | 1e-5 to 5e-4 |
| batch_size | Samples per update | 4-32 |
| epochs | Training iterations | 1-10 |
| LoRA rank (r) | Adapter complexity | 8-64 |
| alpha | Scaling factor | 2x to 4x r |
Learning Rate Recommendations
- 7B models: 2e-4
- 13B models: 1e-4
- 70B models: 5e-5
Training Infrastructure
Hardware Requirements
| Model | Method | GPU Memory | Hardware |
|---|---|---|---|
| 7B | LoRA | 8-10 GB | RTX 3090 |
| 7B | QLoRA | 6-8 GB | RTX 3070 |
| 13B | LoRA | 16-20 GB | A100 40GB |
| 70B | QLoRA | 40-48 GB | A100 80GB |
Cloud Options
- Google Colab: Free tier with T4 (limited)
- Lambda Cloud: $0.60/hour for A10
- RunPod: $0.44/hour for RTX 4090
- AWS SageMaker: Enterprise-grade, higher cost
Evaluation
Quantitative Metrics
from evaluate import load
# Perplexity
perplexity = load("perplexity")
results = perplexity.compute(model_id=model_name, predictions=predictions)
# BLEU (for translation/tasks)
bleu = load("bleu")
results = bleu.compute(predictions=predictions, references=references)
Qualitative Evaluation
Test your model:
- Hold-out test set: 10-20% of data
- Edge cases: Unusual inputs
- Adversarial tests: Attempts to break the model
- Human evaluation: Expert review of outputs
Common Issues and Solutions
Catastrophic Forgetting
Problem: Model loses general knowledge
Solutions:
- Include diverse training data
- Use lower learning rate
- Shorter training (fewer epochs)
- Mix with general instruction data
Overfitting
Symptoms: Perfect training loss, poor generalization
Solutions:
- Add regularization (dropout, weight decay)
- More training data
- Early stopping
- Reduce model complexity (lower rank)
Training Instability
Symptoms: Loss spikes, NaN values
Solutions:
- Lower learning rate
- Gradient clipping
- Check data quality
- Use mixed precision carefully
Deployment
Option 1: Hugging Face Inference API
from huggingface_hub import HfApi
api = HfApi()
api.upload_folder(
folder_path="./merged-model",
repo_id="yourusername/your-model",
repo_type="model"
)
Option 2: Local Deployment
from transformers import pipeline
generator = pipeline(
"text-generation",
model="./merged-model",
tokenizer=tokenizer
)
output = generator("### Instruction:\nSummarize this article\n\n### Response:", max_length=200)
Option 3: vLLM for Production
from vllm import LLM, SamplingParams
llm = LLM(model="./merged-model")
sampling_params = SamplingParams(temperature=0.7, max_tokens=200)
outputs = llm.generate(prompts, sampling_params)
Cost Analysis
Training Costs (Approximate)
| Model | Duration | Cloud Cost |
|---|---|---|
| 7B LoRA | 1-2 hours | $1-3 |
| 13B LoRA | 2-4 hours | $5-15 |
| 70B QLoRA | 6-12 hours | $50-150 |
Inference Costs
Fine-tuned models cost the same to run as base models—you only pay for the additional storage (~10-100MB for LoRA adapters).
Next Steps
After mastering fine-tuning:
- RLHF: Reinforcement learning from human feedback
- DPO: Direct preference optimization
- Multi-task training: Single model for multiple tasks
- Continual learning: Update models with new data
Learn more AI development techniques in our guides section and explore AI tools.