Large Language Models & Transformers
ChatGPT, Claude, Gemini — they're all built on the same breakthrough architecture. This lesson explains what large language models are, how they generate text, and the key terms you need to know.
After this lesson, you'll be able to:
- ✓ Explain how large language models generate text
- ✓ Describe the attention mechanism and why it matters
- ✓ Define key LLM vocabulary: tokens, context window, parameters, temperature
What Is a Large Language Model (LLM)?
A large language model is a type of deep learning model trained on massive amounts of text. It works by predicting the next word (or token) in a sequence, which allows it to generate human-like text.
When you ask an LLM a question, it doesn't look up the answer in a database. Instead, it generates a response one token at a time, each time predicting what text should come next based on everything before it.
Key Insight
LLMs don't "understand" language the way humans do. They are very good at predicting what text should come next based on patterns in their training data. The result is text that is often remarkably coherent and useful — but it's pattern matching, not comprehension.
The Transformer Architecture
The Transformer is the architecture that makes modern LLMs possible. It was introduced in a 2017 research paper titled "Attention Is All You Need" — and it lived up to its name.
The Key Innovation: Attention
Before Transformers, models processed text one word at a time, left to right. The attention mechanism changed everything: it allows the model to look at all the words at once and figure out which ones are most relevant to each other.
Example
Consider the sentence: "The cat sat on the mat because it was tired."
What does "it" refer to? A human instantly knows it means "the cat." The attention mechanism allows the model to make this same connection — it learns to attend to "cat" when processing "it," even though other nouns like "mat" are closer in the sentence.
Illustrative weights, not measured outputs — real attention happens in many parallel "heads" simultaneously.
Key Vocabulary
These are the terms you'll hear in every conversation about LLMs. Understanding them puts you ahead of most people.
Who Makes What
The AI landscape moves fast, but here are the major players and their flagship models:
OpenAI
GPT-4, ChatGPT
Anthropic
Claude
Gemini
Meta
Llama (open source)
Open source models (like Meta's Llama and Mistral) release their model weights publicly. Anyone can download, use, modify, and build on them. This means companies can run these models on their own servers, customize them for specific tasks, and avoid sending data to a third party.
Closed models (like GPT-4 and Claude) are only accessible through an API or web interface. The model weights are proprietary. You send your input to the company's servers and get a response back.
The trade-off: open source gives you control and privacy, but closed models are often more capable (for now) and easier to use without infrastructure expertise. Many organizations use both, depending on the task.
Key Takeaway
LLMs generate text by predicting the next token, powered by the Transformer architecture's attention mechanism. They don't understand language — they're extremely good at pattern matching. Knowing the vocabulary (tokens, context window, parameters, temperature) lets you use and evaluate these tools more effectively.
Try This (Optional)
Think of an AI tool you've used in the last week. Find one moment where a response sounded confident but turned out to be wrong. What about the output — tone, specificity, fluent prose — made it feel more authoritative than it should have? That mismatch is the pattern-matching surface without underlying comprehension.
Knowledge Check
How does an LLM generate text?
Which statement is most accurate about how an LLM "knows" things?