TIL-05: Deep Learning 01

I'm working through the basics as a stepping stone toward studying Andrej Karpathy's nanochat.
Starting Point: Why Do We Need Machine Learning?
Let's start from the limitations of traditional programming.
The World We Know as Developers
When we write functions, we write the rules directly in code. Obviously.
// Traditional programming: humans write the rules
int celsiusToFahrenheit(int celsius) {
return celsius * 9 / 5 + 32;
Humans write the rules (code) → Computers process the data
Input → Apply rules → Output. Clear and predictable. But...
Problems Where You Can't Write the Rules
What if you had to write a function like this?
// Can you hand-write the rules for this function?
String whatAnimalsIsThis(Image photo) {
// ??? How do you tell a cat from pixel values?
return "cat"; // or "dog"
}You could try rules like "pointy ears = cat," but some dogs have pointy ears too, and some cats have round ears. "Short fur?" "Has whiskers?" ... The rules get endlessly complex, and exceptions keep piling up.
The Core Idea of Machine Learning
Instead of humans writing the rules,
show the computer massive amounts of 'labeled example data' and let it figure out the rules on its own.
Humans provide the data → Computers discover the rules
- 100,000 cat photos + labels
- Depends on data quantity and quality
The way computers "discover the rules" is through something called a Neural Network.
1. Neuron: The Smallest Unit
What Is a Neuron?
A neural network is made up of small units called neurons. What a single neuron does is surprisingly simple.
What a neuron does:
- Takes inputs
- Multiplies each by a weight
- Adds them all up
- Passes the result through an activation function
In pseudocode:
// Everything a single neuron does
double neuron(double[] inputs, double[] weights, double bias) {
double sum = bias;
for (int i = 0; i < inputs.length; i++) {
sum += inputs[i] * weights[i]; // input × weight
}
return activate(sum); // pass through activation function
}Following the Numbers
Let's look at a concrete example. Assume an animal classification neuron takes 2 inputs:
Interactive — Walking Through Neuron Computation
Move the sliders to see how weights and bias change the result.
→ Strong negative signal
Key Terms
| Term | Developer Analogy | Description |
|---|---|---|
| Weight | Config variable | A number representing how important each input is. Learned! |
| Bias | Default offset | The baseline value a neuron has even when all inputs are 0 |
| Activation function | Like an if-statement | Controls the output range. e.g., ReLU (negatives → 0), tanh (−1 to 1) |
| Parameter | Tuning dial | Collective term for weights + biases. What learning adjusts |
A single neuron's core operations are just multiplication, addition, and an activation function. But connect millions of them together and something remarkable happens. That's a neural network.
2. Neural Network: Combining Neurons
Stacking simple building blocks layer by layer to capture complex patterns
What Is a Layer?
A single neuron can only make very simple decisions like "if it's light, it's a cat." For complex decisions, you need to group neurons into layers and stack them.
Developer analogy: a pipeline
A neural network is like a data processing pipeline.
Input data → Layer 1 → Layer 2 → Layer 3 → Final output
Like a Spring Filter Chain, each layer transforms the data and passes it to the next.
The difference is that each filter's rules aren't hardcoded — they're **determined by learning**.
3 Types of Layers
| Layer | Role | Analogy |
|---|---|---|
| Input layer | Receives raw data | API Request Body |
| Hidden layer | Extracts/transforms patterns | Internal service logic (can stack multiple) |
| Output layer | Returns the final prediction | API Response Body |
Why Stack "Deep"? (What "Deep" Means)
The "deep" in "Deep Learning" means multiple hidden layers.
More layers = capturing more complex patterns:
- Layer 1: Very simple features ("this area is bright", "there's a straight line here")
- Layer 2: Slightly more complex features ("this is a round shape", "this is striped")
- Layer 3: High-level features ("this is an eye", "this is an ear")
- Layer 4: Final decision ("this is a cat")
Each layer combines the previous layer's results to build increasingly abstract concepts.
Parameter Count = Neurons × Connections
Let's do a quick calculation:
// A very small neural network
Input layer: 3 neurons
Hidden layer: 4 neurons
Output layer: 2 neurons
// Parameter count
Layer1→Layer2: 3 × 4 = 12 weights + 4 biases = 16
Layer2→Layer3: 4 × 2 = 8 weights + 2 biases = 10
────────────────────────
Total parameters: 26
// GPT-2 (nanochat d26): ~568 million
3. Forward Pass: From Input to Output
How data flows through a neural network
What Is a Forward Pass?
You feed input data into the neural network and pass it through layer by layer until you get the final output (prediction). This process is called a Forward Pass — literally "passing straight forward."
Developer analogy:
Just like an HTTP request flows through Controller → Service → Repository to produce a response,
data flows through Input Layer → Hidden Layers → Output Layer to produce a prediction.
Following the Numbers: A Mini Neural Network
Let's compute a Forward Pass from start to finish with a tiny neural network.
Problem setup:
Goal: A neural network that takes 2 test scores and predicts pass/fail
Structure: 2 inputs → 2 hidden neurons → 1 output
Activation function: ReLU (negatives become 0, positives stay as-is)
Interactive — Full Forward Pass
Change the input scores and observe how the neural network's prediction changes.
▸ Input normalization: x₁=0.80, x₂=0.70
Hidden layerOutput layer
A Forward Pass is pure math — a chain of multiplications and additions. Input numbers get transformed as they pass through layers, producing the final prediction.
But the weights in the example above (0.3, −0.1, etc.) are values I made up. Adjusting these weights to the "correct values" is exactly what learning is. And the core algorithm behind learning is Backpropagation.
4. Loss: How Wrong Are We?
Before understanding Backpropagation, we need to measure "how wrong"
Why Do We Need Loss?
The neural network made a prediction. Now we need to measure "how accurate it was" as a number. This number is called Loss.
Developer analogy:
In test code, when assertEquals(expected, actual) fails, you get an error message.
Loss is that "how different are they" expressed as a single number.
Loss = 0 → Perfect match (test passed)
Loss = large number → Way off (test failed)
The Simplest Loss: Squared Difference
Loss = (prediction − answer)²
(Squared Error)
Why square instead of just subtracting?
| Prediction | Answer | Difference (pred − answer) | Squared (Loss) |
|---|---|---|---|
| 0.8 | 1.0 | −0.2 | 0.04 |
| 0.3 | 1.0 | −0.7 | 0.49 |
| 1.2 | 1.0 | +0.2 | 0.04 |
| −0.5 | 1.0 | −1.5 | 2.25 |
Squaring solves two things:
- ① Always positive regardless of direction (+/−)
- ② The penalty grows dramatically for larger errors. A 0.2 difference gives Loss 0.04, but a 1.5 difference gives Loss 2.25 — it scales up fast.
The Key Point: The Sole Goal of Learning Is to Minimize Loss
Loss decreasing = predictions getting closer to the answer = the model is learning.
The val_bpb (validation bits per byte) monitored in nanochat is a type of Loss.
Alright, now we're ready.
Here's what we know so far:
- Forward Pass computes the prediction
- Loss measures "how wrong it is"
- The remaining question: "How should we adjust each weight to reduce the Loss?"
The algorithm that answers this question is Backpropagation.
5. Backpropagation: Propagating the Error Backward
The algorithm that figures out "which direction to adjust each weight" to reduce Loss
THE QUESTION BACKPROPAGATION ANSWERS
If we change weight w by a tiny amount, how much does Loss change, and in which direction?
Once we know this, we can adjust the weight in the direction that reduces Loss. This "how much and in which direction" is called the gradient.
Analogy: Walking Down a Mountain Blindfolded
Imagine this. You're standing on a mountain with your eyes closed. You need to get to the lowest point (Loss = 0).
Strategy for descending the mountain = GRADIENT DESCENT
1. Feel the slope with your feet → "This way is downhill" (= compute gradient)
2. Take a step downhill (= update weights)
3. Feel the slope again, take another step (= repeat)
4. Nowhere lower to go → You've arrived! (Loss minimized)
"Feeling the slope with your feet" = Backpropagation
"Walking downhill" = Gradient Descent
What Is a Gradient?
Mathematically, a gradient is a "derivative," but we can understand it intuitively:
Chain Rule: The Core Principle of Backpropagation
In a real neural network, weight w passes through multiple operations before affecting the final Loss. It's like dominoes.
w → (w × x) → (sum) → (activation) → (next layer) → ... → Loss
To find "how much Loss changes when w changes," you multiply the effect of each step in this domino chain on the next step. That's the Chain Rule.
Developer analogy: Error tracing
An error occurred in production (= Loss is high)
You trace the stack trace to find the root cause.
Error in server logs (output)
← caused by: Service Layer (hidden layer)
← caused by: wrong DB query (weight)
Backpropagation works the same way. Starting from the Loss (error), it traces backward (Back) layer by layer to figure out "who's most responsible (gradient)."
That's why it's called Back-propagation.
A Tiny Backpropagation
Let's walk through the entire process with numbers using a very simple example.
Problem setup:
1 neuron. Input x = 2, weight w = 3, bias b = 1
Prediction = w * x + b, target = 10, Loss = (prediction − target)²
Things start getting a bit harder from here... but let's push through!
Here's how I understand the Chain Rule:
[Goal] We want to reduce Loss
↑
[Question] How much does Loss change if we nudge w? = dLoss/dw
↑
[Decomposition] w doesn't change Loss directly — it goes through a chain:
Nudging w → changes prediction → changed prediction → changes Loss
dLoss dLoss dPrediction
───── = ─────── × ───────────
dw dPrediction dw
What I ultimately want is dLoss/dw, but since it's hard to compute directly,
I compute the two intermediate pieces first and multiply them.
- dLoss/dPrediction
- dPrediction/dw
So to get dLoss/dw, we need dLoss/dPrediction and dPrediction/dw. Let's compute each one.
The scenario we'll use:
w = 3.75, b = 3.50
1) Forward pass
Prediction = 3.75 * 2 + 3.50 = 11.00
2) Loss calculation
Loss = (11.00 − 10.00)² = 1.00
Intermediate Step 1: dLoss/dPrediction
dLoss/dPrediction = how much does Loss change when the prediction value changes?
Let's experiment by nudging the prediction slightly. Current prediction is 11.00, target is 10.
Experiment 1: Nudge by 0.05
- Original prediction: 11.00 → Loss = (11.00 − 10)² = 1.0000
- New prediction: 11.05 → Loss = (11.05 − 10)² = 1.1025
- Change in Loss: 0.1025
- Rate of change: 0.1025 / 0.05 = 2.05
Experiment 2: Nudge by a smaller amount, 0.001
- Original prediction: 11.00 → Loss = 1.000000
- New prediction: 11.001 → Loss = (11.001 − 10)² = 1.002001
- Change in Loss: 0.002001
- Rate of change: 0.002001 / 0.001 = 2.001
Notice how the rate of change converges to 2.0 as the interval gets smaller?
This "rate of change when you shrink the interval to its limit" is the derivative, and that value is the gradient. In this case, dLoss/dPrediction = 2.0.
Intermediate Step 2: dPrediction/dw
dPrediction/dw = how much does the prediction change when weight w changes?
This one is much simpler. Since prediction = w × 2 + b, let's nudge w.
- w = 3.75 → prediction = 3.75 × 2 + 3.50 = 11.00
- w = 3.76 → prediction = 3.76 × 2 + 3.50 = 11.02
- Change in prediction: 0.02
- Rate of change: 0.02 / 0.01 = 2.0
Nudging w by 0.01 changed the prediction by 0.02. No matter how much you nudge w, the ratio is always 2.0. That's because in prediction = w × 2 + b, the coefficient on w is x = 2.
In other words, dPrediction/dw = x = 2.0
Final Calculation: dLoss/dw
Now let's combine the two pieces.
dLoss dLoss dPrediction
───── = ─────── × ───────────
dw dPrediction dw
dLoss/dw = 2.0 × 2.0 = 4.0
This means: increasing w by 1 increases Loss by 4. Decreasing w by 1 decreases Loss by 4.
Rather than making large adjustments, it's better to adjust in increments of 0.01 (the learning rate, lr).
So the new weight (w) is:
- New w = old w − lr(0.01) × dLoss/dw
- New w = 3.75 − 0.01 × 4.0
- New w = 3.71
One More Calculation: Bias b
We're not done yet. We need to compute the same thing for bias b.
Same approach as dPrediction/dw. Since prediction = w × 2 + b, the coefficient on b is 1:
- Nudge b by 0.01 → prediction changes by exactly 0.01
- Rate of change: always 1.0
- dPrediction/db = 1.0
And dLoss/db = dLoss/dPrediction (reusing the value from earlier) × dPrediction/db = 2.0 × 1.0 = 2.0
So the new b is:
- New b = old b − lr(0.01) × dLoss/db
- New b = 3.50 − 0.01 × 2.0
- New b = 3.48
Backpropagation Summary
Why do we compute the gradient for w and b separately?
- Because they affect the prediction (Forward Pass) to different degrees.
- w gets multiplied by input x before being added to the prediction, while b is added directly.
- In other words, w's influence on the prediction scales with x, while b always contributes exactly 1.
What's the benefit of using the Chain Rule?
- It breaks down something hard to compute all at once into easy pieces.
- Computing dLoss/dw directly is complex, but computing dLoss/dPrediction and dPrediction/dw separately then multiplying is straightforward.
- The advantage becomes clear at neural network scale:
- Even with 26 layers, each layer's gradient is just a simple multiplication.
- Multiply layer by layer from back to front, and you get the full gradient.
The Core Idea of Backpropagation
- Applying the Chain Rule "from back to front"
- Computational efficiency:
- Starting from the output and going backward lets you reuse intermediate gradients.
- Example: compute dLoss/dPrediction once, reuse it for both dLoss/dw and dLoss/db
- Going front to back instead? You'd have to recompute from scratch for every parameter.
- With 500 million parameters? The difference in computation is enormous.
- Starting from the output and going backward lets you reuse intermediate gradients.
6. The Training Loop
Forward Pass + Loss + Backpropagation together = "learning"
Training Loop = It's All Repetition
Putting everything we've learned together, neural network training is a surprisingly simple loop:
1. Forward Pass
: Feed data in and compute the prediction
2. Loss Calculation
: Measure how far the prediction is from the answer
3. Backward Pass (Backpropagation)
: Trace back from the Loss to compute each parameter's gradient
4. Parameter Update
: Nudge each parameter in the opposite direction of its gradient
5. Go back to step 1 and repeat
: Repeat tens of thousands to billions of times, and Loss gradually decreases
In pseudocode:
for (step = 0; step < 1_000_000; step++) {
// 1. Forward: predict
prediction = model.forward(input_data);
// 2. Loss: how wrong?
loss = (prediction - target) ** 2;
// 3. Backward: compute each parameter's gradient
gradients = backpropagate(loss);
// 4. Update: adjust in the opposite direction of the gradient
for each param in model.parameters:
param -= learning_rate * gradients[param];
}Getting a Sense of Real Scale
| Pseudocode above | NANOCHAT GPT-2 | |
|---|---|---|
| Parameters | A few dozen | ~568 million |
| Training data | A handful | ~4.5 billion tokens (web text) |
| Iterations | Thousands | Tens of thousands of steps (each step = 500K tokens) |
| Time | 1 second | ~1.65 hours (8×H100 GPUs) |
| Cost | $0 | ~$48 |
7. Connecting to LLMs and nanochat
How everything we've learned applies to real LLMs
An LLM = A Massive Neural Network Optimized for "Next Word Prediction"
Every concept we've covered applies directly to LLMs. The only thing that changes is "what it's predicting."
LLM training loop:
1. Forward: "The sky is" → pass through neural network → predict probability distribution for next word
2. Loss: Compare with the correct answer "blue." If "blue" had low probability, Loss is high
3. Backward: Compute gradients for all 568 million parameters
4. Update: Adjust parameters so "blue" gets higher probability
5. Repeat: Do this for 4.5 billion tokens
Do this to completion → you get a neural network that "understands English."
Mapping to nanochat Code
| Concept | NANOCHAT File | Role |
|---|---|---|
| Network architecture | gpt.py | Transformer model definition (neurons, layer structure) |
| Forward Pass | gpt.py's forward() | Input tokens → next token probabilities |
| Loss calculation | base_train.py | Cross-entropy loss (for classification) |
| Backpropagation | PyTorch's .backward() | Automatic gradient computation (autograd) |
| Parameter update | optim.py | AdamW + Muon optimizer |
| Training data | dataloader.py | ClimbMix data loading |
| Inference (generation) | engine.py | Generate tokens one at a time with the trained model |
I'm planning to build a model myself, hands-on, one piece at a time, using Andrej Karpathy's nanochat modules directly.
Going forward, I'll be analyzing nanochat's code one piece at a time, running it myself, and studying whatever additional concepts come up along the way. I'll share everything through this blog.
Comments (0)
Checking login status...
No comments yet. Be the first to comment!
