Modern LLMs (like GPT-4 or Claude) use a three-stage training process. It is helpful to distinguish between the Base Training (how it learns language) and Instruction Tuning (how it learns to be a helpful assistant).
1. Pre-training (Self-Supervised Learning)
This is where the model spends 99% of its "learning" time. It doesn't use human labels. Instead, the data is the label.
The Goal
Predict the next word (token).
The Process
We feed it trillions of sentences from the internet. We hide the last word of a sentence, and the AI guesses it.
How it "Learns"
The AI compares its guess to the actual word. If it was wrong, it calculates the "error" and mathematically adjusts its weights (the internal probability numbers). It strengthens the connections that lead to the right word and weakens the ones that lead to the wrong word. Over trillions of repetitions, these probability numbers become incredibly accurate.
Result
It learns grammar, facts, and even coding patterns just by trying to be the world's best "autocomplete."
2. Instruction Tuning (Supervised Learning)
After the model knows how to speak, we teach it how to follow orders.
The Goal
Align the model to specific tasks (e.g., "Write a Python script").
The Process
Human developers write "Sets" of data containing specific examples:
Input: "Write a function to sort a list."
Label (The Answer): "def sort_list(my_list): ..."
Result
The model learns the difference between "continuing a sentence" and "answering a prompt."
3. RLHF (Reinforcement Learning from Human Feedback)
The final "polishing" stage.
The Process
The AI generates two different answers to the same prompt. A human marks which one is better.
Result
The model learns nuances like safety, tone, and which coding style is more "industry standard."
|