Lesson 3: Large Language Models & Transformers

What Is a Large Language Model (LLM)?

A large language model is a type of deep learning model trained on massive amounts of text. It works by predicting the next word (or token) in a sequence, which allows it to generate human-like text.

When you ask an LLM a question, it doesn't look up the answer in a database. Instead, it generates a response one token at a time, each time predicting what text should come next based on everything before it.

Key Insight

LLMs don't "understand" language the way humans do. They are very good at predicting what text should come next based on patterns in their training data. The result is text that is often remarkably coherent and useful — but it's pattern matching, not comprehension.

Analogy: Like autocomplete on your phone, but trained on a huge portion of the internet and capable of generating entire paragraphs, not just words. Your phone predicts the next word; an LLM predicts the next thousand.

The Transformer Architecture

The Transformer is the architecture that makes modern LLMs possible. It was introduced in a 2017 research paper titled "Attention Is All You Need" — and it lived up to its name.

The Key Innovation: Attention

Before Transformers, models processed text one word at a time, left to right. The attention mechanism changed everything: it allows the model to look at all the words at once and figure out which ones are most relevant to each other.

Example

Consider the sentence: "The cat sat on the mat because it was tired."

What does "it" refer to? A human instantly knows it means "the cat." The attention mechanism allows the model to make this same connection — it learns to attend to "cat" when processing "it," even though other nouns like "mat" are closer in the sentence.

Illustrative weights, not measured outputs — real attention happens in many parallel "heads" simultaneously.

Analogy: Reading a whole paragraph and highlighting the important relationships between words, vs. reading one word at a time from left to right and trying to remember what came before.

Key Vocabulary

These are the terms you'll hear in every conversation about LLMs. Understanding them puts you ahead of most people.

Token

The basic unit an LLM processes. A token is roughly 4 characters or 0.75 words. Common short words like "the" or "cat" are usually one token; longer or rarer words split into pieces — for example, "tokenization" becomes "token" + "ization."

Context Window

The maximum amount of text (measured in tokens) a model can process at once. A larger context window means the model can "see" more of the conversation or document at the same time.

Parameters

Internal values learned during training. Think of them as the knowledge stored in the network's connections. More parameters generally means a more capable model. GPT-4 is reportedly in the trillion-parameter range, though OpenAI has never officially disclosed the number; Meta's open Llama 3.1 model has 405 billion.

Temperature

Controls how random or creative the model's output is. Low temperature (0) = focused, predictable responses. High temperature (1+) = more creative, varied, and sometimes surprising responses.

Who Makes What

The AI landscape moves fast, but here are the major players and their flagship models:

OpenAI

GPT-4, ChatGPT

Anthropic

Claude

Google

Gemini

Large Language Models & Transformers

What Is a Large Language Model (LLM)?

The Transformer Architecture

The Key Innovation: Attention

Key Vocabulary

Who Makes What

Knowledge Check