Skip to main content
Lesson 3

Large Language Models & Transformers

ChatGPT, Claude, Gemini — they're all built on the same breakthrough architecture. This lesson explains what large language models are, how they generate text, and the key terms you need to know.

After this lesson, you'll be able to:

  • Explain how large language models generate text
  • Describe the attention mechanism and why it matters
  • Define key LLM vocabulary: tokens, context window, parameters, temperature

What Is a Large Language Model (LLM)?

A large language model is a type of deep learning model trained on massive amounts of text. It works by predicting the next word (or token) in a sequence, which allows it to generate human-like text.

When you ask an LLM a question, it doesn't look up the answer in a database. Instead, it generates a response one token at a time, each time predicting what text should come next based on everything before it.

Key Insight

LLMs don't "understand" language the way humans do. They are very good at predicting what text should come next based on patterns in their training data. The result is text that is often remarkably coherent and useful — but it's pattern matching, not comprehension.

Analogy: Like autocomplete on your phone, but trained on a huge portion of the internet and capable of generating entire paragraphs, not just words. Your phone predicts the next word; an LLM predicts the next thousand.

The Transformer Architecture

The Transformer is the architecture that makes modern LLMs possible. It was introduced in a 2017 research paper titled "Attention Is All You Need" — and it lived up to its name.

The Key Innovation: Attention

Before Transformers, models processed text one word at a time, left to right. The attention mechanism changed everything: it allows the model to look at all the words at once and figure out which ones are most relevant to each other.

Example

Consider the sentence: "The cat sat on the mat because it was tired."

What does "it" refer to? A human instantly knows it means "the cat." The attention mechanism allows the model to make this same connection — it learns to attend to "cat" when processing "it," even though other nouns like "mat" are closer in the sentence.

Attention weights diagram In the sentence "The cat sat on the mat because it was tired," arrows show that the word "it" attends most strongly to "cat," with weaker attention to "mat" and "tired." The cat sat on the mat because it was tired strong attention medium weak

Illustrative weights, not measured outputs — real attention happens in many parallel "heads" simultaneously.

Analogy: Reading a whole paragraph and highlighting the important relationships between words, vs. reading one word at a time from left to right and trying to remember what came before.

Key Vocabulary

These are the terms you'll hear in every conversation about LLMs. Understanding them puts you ahead of most people.

Token
The basic unit an LLM processes. A token is roughly 4 characters or 0.75 words. Common short words like "the" or "cat" are usually one token; longer or rarer words split into pieces — for example, "tokenization" becomes "token" + "ization."
Context Window
The maximum amount of text (measured in tokens) a model can process at once. A larger context window means the model can "see" more of the conversation or document at the same time.
Parameters
Internal values learned during training. Think of them as the knowledge stored in the network's connections. More parameters generally means a more capable model. GPT-4 is reportedly in the trillion-parameter range, though OpenAI has never officially disclosed the number; Meta's open Llama 3.1 model has 405 billion.
Temperature
Controls how random or creative the model's output is. Low temperature (0) = focused, predictable responses. High temperature (1+) = more creative, varied, and sometimes surprising responses.

Who Makes What

The AI landscape moves fast, but here are the major players and their flagship models:

OpenAI

GPT-4, ChatGPT

Anthropic

Claude

Google

Gemini

Meta

Llama (open source)

Open source models (like Meta's Llama and Mistral) release their model weights publicly. Anyone can download, use, modify, and build on them. This means companies can run these models on their own servers, customize them for specific tasks, and avoid sending data to a third party.

Closed models (like GPT-4 and Claude) are only accessible through an API or web interface. The model weights are proprietary. You send your input to the company's servers and get a response back.

The trade-off: open source gives you control and privacy, but closed models are often more capable (for now) and easier to use without infrastructure expertise. Many organizations use both, depending on the task.

Key Takeaway

LLMs generate text by predicting the next token, powered by the Transformer architecture's attention mechanism. They don't understand language — they're extremely good at pattern matching. Knowing the vocabulary (tokens, context window, parameters, temperature) lets you use and evaluate these tools more effectively.

Try This (Optional)

Think of an AI tool you've used in the last week. Find one moment where a response sounded confident but turned out to be wrong. What about the output — tone, specificity, fluent prose — made it feel more authoritative than it should have? That mismatch is the pattern-matching surface without underlying comprehension.

Knowledge Check

How does an LLM generate text?

Correct! LLMs generate text by predicting the next token, one at a time, based on patterns learned during training. There's no database lookup involved.
Not quite. LLMs don't store or look up answers. They predict the most likely next token based on patterns learned from their training data, generating text one piece at a time.

Which statement is most accurate about how an LLM "knows" things?

Correct! LLMs are very good at predicting text patterns, but they don't have understanding or consciousness. They produce text that sounds right based on statistical patterns — impressive, but fundamentally different from human comprehension.
Not quite. LLMs don't understand or look anything up. They recognize statistical patterns in text and predict what's likely to come next. The output sounds right because the patterns fit, not because the model comprehends.