AI Basics - How Modern LLMs are Trained

[Last Updated: Jan 6, 2026]

Modern LLMs (like GPT-4 or Claude) use a three-stage training process. It is helpful to distinguish between the Base Training (how it learns language) and Instruction Tuning (how it learns to be a helpful assistant).

1. Pre-training (Self-Supervised Learning)

This is where the model spends 99% of its "learning" time. It doesn't use human labels. Instead, the data is the label.

The Goal

Predict the next word (token).

The Process

We feed it trillions of sentences from the internet. We hide the last word of a sentence, and the AI guesses it.

How it "Learns"

The AI compares its guess to the actual word. If it was wrong, it calculates the "error" and mathematically adjusts its weights (the internal probability numbers). It strengthens the connections that lead to the right word and weakens the ones that lead to the wrong word. Over trillions of repetitions, these probability numbers become incredibly accurate.

Result

It learns grammar, facts, and even coding patterns just by trying to be the world's best "autocomplete."

2. Instruction Tuning (Supervised Learning)

After the model knows how to speak, we teach it how to follow orders.

The Goal

Align the model to specific tasks (e.g., "Write a Python script").

The Process

Human developers write "Sets" of data containing specific examples:

Input: "Write a function to sort a list."

Label (The Answer): "def sort_list(my_list): ..."

Result

The model learns the difference between "continuing a sentence" and "answering a prompt."

3. RLHF (Reinforcement Learning from Human Feedback)

The final "polishing" stage.

The Process

The AI generates two different answers to the same prompt. A human marks which one is better.

Result

The model learns nuances like safety, tone, and which coding style is more "industry standard."

AI Basics - How Modern LLMs are Trained

1. Pre-training (Self-Supervised Learning)

The Goal

The Process

How it "Learns"

Result

2. Instruction Tuning (Supervised Learning)

The Goal

The Process

Result

3. RLHF (Reinforcement Learning from Human Feedback)

The Process

Result

See Also