Fine-Tuning LLMs: A Practical Guide for Custom AI Models

Fine-tuning allows you to adapt powerful pre-trained language models to your specific needs. Whether you’re building a customer service bot, a medical assistant, or a code generator, fine-tuning can dramatically improve performance.

When to Fine-Tune

Fine-Tuning is Right For:

Domain-specific tasks: Legal, medical, financial language
Style adaptation: Matching your brand voice
Consistent formatting: Structured outputs like JSON
Proprietary knowledge: Company-specific information

Prompt Engineering is Better For:

Simple tasks: Quick experiments, one-off queries
Rapid prototyping: Testing ideas before investing
General knowledge: Tasks within model’s training data
Budget constraints: No training infrastructure needed

Fine-Tuning Methods

1. Full Fine-Tuning

Update all model parameters. Most comprehensive but expensive.

Pros: Best performance Cons: Requires significant compute, risk of catastrophic forgetting

2. Parameter-Efficient Fine-Tuning (PEFT)

Update only a small subset of parameters.

LoRA (Low-Rank Adaptation)

from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=16,  # rank
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)

Pros: 1000x fewer parameters, faster training, smaller checkpoints Cons: Slightly lower performance than full fine-tuning

QLoRA

Quantized LoRA for even lower memory usage.

from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16
)

Step-by-Step Fine-Tuning Tutorial

Step 1: Environment Setup

pip install transformers datasets accelerate peft bitsandbytes

Step 2: Prepare Your Dataset

from datasets import Dataset

# Example: Customer support conversations
data = [
    {
        "instruction": "How do I reset my password?",
        "input": "",
        "output": "To reset your password, click 'Forgot Password' on the login page. Enter your email address and check your inbox for a reset link. The link expires in 24 hours."
    },
    # Add more examples...
]

dataset = Dataset.from_list(data)

# Format for training
def format_prompt(example):
    if example["input"]:
        prompt = f"### Instruction:\n{example['instruction']}\n\n### Input:\n{example['input']}\n\n### Response:\n{example['output']}"
    else:
        prompt = f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['output']}"
    return {"text": prompt}

dataset = dataset.map(format_prompt)

Step 3: Load Base Model

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "meta-llama/Llama-2-7b-hf"  # or "mistralai/Mistral-7B-v0.1"

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,  # For QLoRA
    device_map="auto"
)

Step 4: Configure LoRA

from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

model = prepare_model_for_kbit_training(model)

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)

Step 5: Set Up Training

from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    optim="paged_adamw_32bit",
    save_steps=100,
    logging_steps=10,
    learning_rate=2e-4,
    weight_decay=0.001,
    fp16=True,
    warmup_ratio=0.03,
    group_by_length=True,
    lr_scheduler_type="constant"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    data_collator=transformers.DataCollatorForSeq2Seq(
        tokenizer, pad_to_multiple_of=8, return_tensors="pt", padding=True
    )
)

Step 6: Train

model.config.use_cache = False  # Enable for training
trainer.train()

Step 7: Save and Export

# Save LoRA adapter
model.save_pretrained("./lora-adapter")

# Merge with base model (optional)
from peft import AutoPeftModelForCausalLM

merged_model = model.merge_and_unload()
merged_model.save_pretrained("./merged-model")

Dataset Best Practices

Data Quantity

Model Size	Minimum Examples	Recommended
7B	100-500	1,000-5,000
13B	200-1,000	2,000-10,000
70B	500-2,000	5,000-20,000

Data Quality Guidelines

Format consistency: Use the same template for all examples
Output quality: Examples should represent your desired output
Diversity: Cover edge cases and variations
Length variety: Mix short and long responses
Clean data: Remove duplicates and errors

Data Format Example

{
  "instruction": "Summarize the following article",
  "input": "[Article text here...]",
  "output": "[High-quality summary...]"
}

Hyperparameter Tuning

Key Parameters

Parameter	Description	Typical Range
learning_rate	Step size	1e-5 to 5e-4
batch_size	Samples per update	4-32
epochs	Training iterations	1-10
LoRA rank (r)	Adapter complexity	8-64
alpha	Scaling factor	2x to 4x r

Learning Rate Recommendations

7B models: 2e-4
13B models: 1e-4
70B models: 5e-5

Training Infrastructure

Hardware Requirements

Model	Method	GPU Memory	Hardware
7B	LoRA	8-10 GB	RTX 3090
7B	QLoRA	6-8 GB	RTX 3070
13B	LoRA	16-20 GB	A100 40GB
70B	QLoRA	40-48 GB	A100 80GB

Cloud Options

Google Colab: Free tier with T4 (limited)
Lambda Cloud: $0.60/hour for A10
RunPod: $0.44/hour for RTX 4090
AWS SageMaker: Enterprise-grade, higher cost

Evaluation

Quantitative Metrics

from evaluate import load

# Perplexity
perplexity = load("perplexity")
results = perplexity.compute(model_id=model_name, predictions=predictions)

# BLEU (for translation/tasks)
bleu = load("bleu")
results = bleu.compute(predictions=predictions, references=references)

Qualitative Evaluation

Test your model:

Hold-out test set: 10-20% of data
Edge cases: Unusual inputs
Adversarial tests: Attempts to break the model
Human evaluation: Expert review of outputs

Common Issues and Solutions

Catastrophic Forgetting

Problem: Model loses general knowledge

Solutions:

Include diverse training data
Use lower learning rate
Shorter training (fewer epochs)
Mix with general instruction data

Overfitting

Symptoms: Perfect training loss, poor generalization

Solutions:

Add regularization (dropout, weight decay)
More training data
Early stopping
Reduce model complexity (lower rank)

Training Instability

Symptoms: Loss spikes, NaN values

Solutions:

Lower learning rate
Gradient clipping
Check data quality
Use mixed precision carefully

Deployment

Option 1: Hugging Face Inference API

from huggingface_hub import HfApi

api = HfApi()
api.upload_folder(
    folder_path="./merged-model",
    repo_id="yourusername/your-model",
    repo_type="model"
)

Option 2: Local Deployment

from transformers import pipeline

generator = pipeline(
    "text-generation",
    model="./merged-model",
    tokenizer=tokenizer
)

output = generator("### Instruction:\nSummarize this article\n\n### Response:", max_length=200)

Option 3: vLLM for Production

from vllm import LLM, SamplingParams

llm = LLM(model="./merged-model")
sampling_params = SamplingParams(temperature=0.7, max_tokens=200)

outputs = llm.generate(prompts, sampling_params)

Cost Analysis

Training Costs (Approximate)

Model	Duration	Cloud Cost
7B LoRA	1-2 hours	$1-3
13B LoRA	2-4 hours	$5-15
70B QLoRA	6-12 hours	$50-150

Inference Costs

Fine-tuned models cost the same to run as base models—you only pay for the additional storage (~10-100MB for LoRA adapters).

Next Steps

After mastering fine-tuning:

RLHF: Reinforcement learning from human feedback
DPO: Direct preference optimization
Multi-task training: Single model for multiple tasks
Continual learning: Update models with new data

Learn more AI development techniques in our guides section and explore AI tools.

Fine-Tuning LLMs: A Practical Guide for Custom AI Models

Fine-Tuning LLMs: A Practical Guide for Custom AI Models

When to Fine-Tune

Fine-Tuning is Right For:

Prompt Engineering is Better For:

Fine-Tuning Methods

1. Full Fine-Tuning

2. Parameter-Efficient Fine-Tuning (PEFT)

LoRA (Low-Rank Adaptation)

QLoRA

Step-by-Step Fine-Tuning Tutorial

Step 1: Environment Setup

Step 2: Prepare Your Dataset

Step 3: Load Base Model

Step 4: Configure LoRA

Step 5: Set Up Training

Step 6: Train

Step 7: Save and Export

Dataset Best Practices

Data Quantity

Data Quality Guidelines

Data Format Example

Hyperparameter Tuning

Key Parameters

Learning Rate Recommendations

Training Infrastructure

Hardware Requirements

Cloud Options

Evaluation

Quantitative Metrics

Qualitative Evaluation

Common Issues and Solutions

Catastrophic Forgetting

Overfitting

Training Instability

Deployment

Option 1: Hugging Face Inference API

Option 2: Local Deployment

Option 3: vLLM for Production

Cost Analysis

Training Costs (Approximate)

Inference Costs

Next Steps

Share this article

Related Articles

AGI Timeline Predictions: When Will Artificial General Intelligence Arrive?

AI for Climate Change: Machine Learning Solutions for Environmental Crisis

AI in Clinical Trials: Accelerating Drug Development with Machine Learning