Skip to main content
Lesson 5

RAG, Fine-Tuning & Embeddings

You've learned how LLMs work and how to prompt them. But what if you need the AI to know about YOUR specific data — your company's docs, your product manuals, your research? That's where fine-tuning and RAG come in.

After this lesson, you'll be able to:

  • Explain the difference between fine-tuning and RAG
  • Identify when to use fine-tuning vs. RAG for a given scenario
  • Describe how embeddings represent meaning as numbers

Fine-Tuning: Teaching an Expert New Tricks

Fine-tuning means taking a pre-trained model and training it further on a specific, smaller dataset. The model has already learned the fundamentals of language from its massive pre-training — fine-tuning adapts that general knowledge to a specialized domain or task.

Analogy: A pianist trained in classical music (pre-trained) learning to play jazz (fine-tuned). They don't start from zero — they adapt existing skills like rhythm, harmony, and finger technique to a new style. The base musicianship transfers; only the specialization is new.

When to use fine-tuning: When you need the model to consistently behave a certain way or understand domain-specific patterns. For example, training a model on thousands of legal contracts so it writes in the right legal style, or fine-tuning a model on your company's support conversations so it matches your brand voice.

RAG: Retrieval-Augmented Generation

Instead of relying only on what the model learned during training, RAG lets the model retrieve relevant information from an external knowledge base before generating its response. It's like giving the AI access to a reference library.

How RAG Works

RAG pipeline diagram A user question is converted into an embedding, used to search a knowledge base, top documents are retrieved, then both question and documents are sent to the LLM to produce a grounded answer. Question Search (by meaning) Top Docs (retrieved) LLM (Q + docs) Answer embed find attach generate
1

User asks a question — "What is our return policy for electronics?"

2

System searches knowledge base — finds the relevant return policy documents

3

Relevant docs retrieved — the matching policy sections are pulled out

4

LLM generates answer — uses those docs + the question to produce an accurate, grounded response

Analogy: Instead of answering from memory (which might be wrong), you look up the answer in a textbook first, then explain it in your own words. You get the accuracy of the source material combined with the fluency of natural language.

Why RAG matters: It reduces hallucinations (the model making things up), keeps answers current without retraining, and lets you ground responses in your actual data. There's no need to retrain the model when your documents change — just update the knowledge base.

Fine-Tuning vs. RAG

These are two different tools for different problems. Here's how they compare:

Fine-Tuning

  • Permanently changes model weights
  • Bakes knowledge into the model
  • Good for behavior, style, and tone
  • Requires retraining when data changes
  • Higher upfront cost

RAG

  • Keeps model weights as-is
  • Provides info at query time
  • Good for factual, changing info
  • Just update the knowledge base
  • Lower ongoing cost

In practice, many production systems use both together — fine-tuning to set the model's behavior and tone, and RAG to provide accurate, up-to-date factual information.

Embeddings: Meaning as Numbers

Embeddings are a way to represent words, sentences, or documents as numbers (vectors) so that similar meanings end up close together in mathematical space. This is how computers understand that words are related.

For example, "king" and "queen" would be close together because they share similar meaning. "King" and "bicycle" would be far apart. "Happy" and "joyful" would be neighbors, while "happy" and "concrete" would be distant.

Embedding Space (2D illustration)

Embedding space scatter A two-dimensional illustration showing semantically similar words clustering: "cat," "dog," and "puppy" together; "king," "queen," and "prince" together; "table" and "refrigerator" in their own region. Distance reflects difference in meaning. cat dog puppy table refrigerator king queen prince ANIMALS FURNITURE ROYALTY
Analogy: Imagine plotting words on a map where similar words are in the same neighborhood. "Dog," "puppy," and "canine" would all be on the same block. "Refrigerator" would be across town. Embeddings create this map mathematically.

A technical note: real embeddings live in hundreds to a few thousand dimensions depending on the model (OpenAI's text-embedding-3-large uses 3,072), not on a 2D map. The neighborhood analogy captures the principle (similar meanings cluster together), not the shape.

Why embeddings matter: Embeddings power the search step in RAG. When a user asks a question, the system converts the question into an embedding, then finds documents with similar embeddings. This is how it retrieves relevant information — not by matching keywords, but by matching meaning.

Pre-training an LLM is extraordinarily expensive. Training a frontier model from scratch costs millions of dollars in compute, takes months of processing time, and requires massive datasets. Only a handful of companies in the world can do it.

Fine-tuning is much cheaper. It typically takes hours or days instead of months, and costs thousands of dollars instead of millions. You're adjusting an existing model, not building one from the ground up.

RAG requires no retraining at all. You simply update the documents in your knowledge base. This makes it the most cost-effective and flexible option for keeping AI responses current and accurate. That's why RAG has become one of the most popular patterns in production AI systems.

Key Takeaway

Fine-tuning permanently adapts a model for a specific domain. RAG retrieves external information at query time. Fine-tuning is for behavior, RAG is for facts. Embeddings make this all work by converting meaning into searchable numbers.

Try This (Optional)

Your team wants to build an internal chatbot that answers HR policy questions. The policies are updated quarterly, and employees often phrase questions informally ("sick days" instead of "PTO"). Would you lean toward RAG, fine-tuning, or both — and which part of your choice depends on embeddings doing their job well?

Knowledge Check

Your company wants their AI chatbot to answer questions using the latest product documentation that changes weekly. Should they use fine-tuning or RAG?

Correct! RAG is perfect here — it retrieves current documents at query time without needing to retrain the model every week.
Not quite. The key clue is that the documentation changes weekly. Fine-tuning would require retraining every time docs change. RAG retrieves current documents at query time — just update the knowledge base and you're done.

What are embeddings?

Correct! Embeddings convert text into numerical vectors where similar meanings are positioned close together in mathematical space. This is what powers semantic search in RAG systems.
Not quite. Embeddings are a way to represent meaning as numbers (vectors). Words with similar meanings — like "happy" and "joyful" — end up close together, while unrelated words are far apart. This powers the search in RAG.