I'm working through the basics as a stepping stone toward studying Andrej Karpathy's nanochat.

Starting Point: Why Do We Need Machine Learning?

Let's start from the limitations of traditional programming.

The World We Know as Developers

When we write functions, we write the rules directly in code. Obviously.

// Traditional programming: humans write the rules
int celsiusToFahrenheit(int celsius) {
  return celsius * 9 / 5 + 32;

Humans write the rules (code) → Computers process the data

Input → Apply rules → Output. Clear and predictable. But...

Problems Where You Can't Write the Rules

What if you had to write a function like this?

// Can you hand-write the rules for this function?
String whatAnimalsIsThis(Image photo) {
  // ??? How do you tell a cat from pixel values?
  return "cat"; // or "dog"
}

You could try rules like "pointy ears = cat," but some dogs have pointy ears too, and some cats have round ears. "Short fur?" "Has whiskers?" ... The rules get endlessly complex, and exceptions keep piling up.

The Core Idea of Machine Learning

Instead of humans writing the rules,
show the computer massive amounts of 'labeled example data' and let it figure out the rules on its own.

Humans provide the data → Computers discover the rules

100,000 cat photos + labels
Depends on data quantity and quality

The way computers "discover the rules" is through something called a Neural Network.

1. Neuron: The Smallest Unit

What Is a Neuron?

A neural network is made up of small units called neurons. What a single neuron does is surprisingly simple.

What a neuron does:

Takes inputs
Multiplies each by a weight
Adds them all up
Passes the result through an activation function

In pseudocode:

// Everything a single neuron does
double neuron(double[] inputs, double[] weights, double bias) {
  double sum = bias;
  for (int i = 0; i < inputs.length; i++) {
    sum += inputs[i] * weights[i]; // input × weight
  }
  return activate(sum); // pass through activation function
}

Following the Numbers

Let's look at a concrete example. Assume an animal classification neuron takes 2 inputs:

Interactive — Walking Through Neuron Computation

Move the sliders to see how weights and bias change the result.

Input x₁ (body weight kg)4.0

Input x₂ (height cm)25.0

Weight w₁0.5

Weight w₂-0.3

Bias b1.0

x₁ × w₁4.0 × 0.5 = 2.00

x₂ × w₂25.0 × -0.3 = -7.50

Sum + bias2.00 + -7.50 + 1.0 = -4.50

tanh(-4.50)Output = -0.9998

→ Strong negative signal

Key Terms

Term	Developer Analogy	Description
Weight	Config variable	A number representing how important each input is. Learned!
Bias	Default offset	The baseline value a neuron has even when all inputs are 0
Activation function	Like an if-statement	Controls the output range. e.g., ReLU (negatives → 0), tanh (−1 to 1)
Parameter	Tuning dial	Collective term for `weights + biases`. What learning adjusts

A single neuron's core operations are just multiplication, addition, and an activation function. But connect millions of them together and something remarkable happens. That's a neural network.

2. Neural Network: Combining Neurons

Stacking simple building blocks layer by layer to capture complex patterns

What Is a Layer?

A single neuron can only make very simple decisions like "if it's light, it's a cat." For complex decisions, you need to group neurons into layers and stack them.

Developer analogy: a pipeline
A neural network is like a data processing pipeline.

Input data → Layer 1 → Layer 2 → Layer 3 → Final output

Like a Spring Filter Chain, each layer transforms the data and passes it to the next.
The difference is that each filter's rules aren't hardcoded — they're **determined by learning**.

3 Types of Layers

Layer	Role	Analogy
Input layer	Receives raw data	API Request Body
Hidden layer	Extracts/transforms patterns	Internal service logic (can stack multiple)
Output layer	Returns the final prediction	API Response Body

Why Stack "Deep"? (What "Deep" Means)

The "deep" in "Deep Learning" means multiple hidden layers.

More layers = capturing more complex patterns:

Layer 1: Very simple features ("this area is bright", "there's a straight line here")
Layer 2: Slightly more complex features ("this is a round shape", "this is striped")
Layer 3: High-level features ("this is an eye", "this is an ear")
Layer 4: Final decision ("this is a cat")

Each layer combines the previous layer's results to build increasingly abstract concepts.

Parameter Count = Neurons × Connections

Let's do a quick calculation:

// A very small neural network
Input layer: 3 neurons
Hidden layer: 4 neurons
Output layer: 2 neurons

// Parameter count
Layer1→Layer2: 3 × 4 = 12 weights + 4 biases = 16
Layer2→Layer3: 4 × 2 = 8 weights + 2 biases = 10
────────────────────────
Total parameters: 26

// GPT-2 (nanochat d26): ~568 million

3. Forward Pass: From Input to Output

How data flows through a neural network

What Is a Forward Pass?

You feed input data into the neural network and pass it through layer by layer until you get the final output (prediction). This process is called a Forward Pass — literally "passing straight forward."

Developer analogy:

Just like an HTTP request flows through Controller → Service → Repository to produce a response,
data flows through Input Layer → Hidden Layers → Output Layer to produce a prediction.

Following the Numbers: A Mini Neural Network

Let's compute a Forward Pass from start to finish with a tiny neural network.

Problem setup:

Goal: A neural network that takes 2 test scores and predicts pass/fail
Structure: 2 inputs → 2 hidden neurons → 1 output
Activation function: ReLU (negatives become 0, positives stay as-is)

Interactive — Full Forward Pass

Change the input scores and observe how the neural network's prediction changes.

Math score (x₁)80

English score (x₂)70

▸ Input normalization: x₁=0.80, x₂=0.70

Hidden layer

Neuron h₁ = ReLU(0.80×0.6 + 0.70×0.4 - 0.3)0.4600

Neuron h₂ = ReLU(0.80×(-0.2) + 0.70×0.8 - 0.1)0.3000

Output layer

sigmoid(0.460×0.7 + 0.300×0.5 - 0.2)0.5676

Prediction: 56.8% probability of Pass ✅

A Forward Pass is pure math — a chain of multiplications and additions. Input numbers get transformed as they pass through layers, producing the final prediction.

But the weights in the example above (0.3, −0.1, etc.) are values I made up. Adjusting these weights to the "correct values" is exactly what learning is. And the core algorithm behind learning is Backpropagation.

4. Loss: How Wrong Are We?

Before understanding Backpropagation, we need to measure "how wrong"

Why Do We Need Loss?

The neural network made a prediction. Now we need to measure "how accurate it was" as a number. This number is called Loss.

Developer analogy:

In test code, when assertEquals(expected, actual) fails, you get an error message.
Loss is that "how different are they" expressed as a single number.

Loss = 0 → Perfect match (test passed)
Loss = large number → Way off (test failed)

The Simplest Loss: Squared Difference

Loss = (prediction − answer)²

(Squared Error)

Why square instead of just subtracting?

Prediction	Answer	Difference (pred − answer)	Squared (Loss)
0.8	1.0	−0.2	0.04
0.3	1.0	−0.7	0.49
1.2	1.0	+0.2	0.04
−0.5	1.0	−1.5	2.25

Squaring solves two things:

① Always positive regardless of direction (+/−)
② The penalty grows dramatically for larger errors. A 0.2 difference gives Loss 0.04, but a 1.5 difference gives Loss 2.25 — it scales up fast.

The Key Point: The Sole Goal of Learning Is to Minimize Loss

Loss decreasing = predictions getting closer to the answer = the model is learning.

The val_bpb (validation bits per byte) monitored in nanochat is a type of Loss.

Alright, now we're ready.

Here's what we know so far:

Forward Pass computes the prediction
Loss measures "how wrong it is"
The remaining question: "How should we adjust each weight to reduce the Loss?"

The algorithm that answers this question is Backpropagation.

5. Backpropagation: Propagating the Error Backward

The algorithm that figures out "which direction to adjust each weight" to reduce Loss

THE QUESTION BACKPROPAGATION ANSWERS

If we change weight w by a tiny amount, how much does Loss change, and in which direction?

Once we know this, we can adjust the weight in the direction that reduces Loss. This "how much and in which direction" is called the gradient.

Analogy: Walking Down a Mountain Blindfolded

Imagine this. You're standing on a mountain with your eyes closed. You need to get to the lowest point (Loss = 0).

Strategy for descending the mountain = GRADIENT DESCENT
1. Feel the slope with your feet → "This way is downhill" (= compute gradient)
2. Take a step downhill (= update weights)
3. Feel the slope again, take another step (= repeat)
4. Nowhere lower to go → You've arrived! (Loss minimized)

"Feeling the slope with your feet" = Backpropagation
"Walking downhill" = Gradient Descent

What Is a Gradient?

Mathematically, a gradient is a "derivative," but we can understand it intuitively:

Interactive — Understanding Gradient IntuitivelyIn Loss = (w × 3 − 7)², move w around. The gradient tells you 'what happens to Loss if you increase w.'

Weight w1.0

Prediction = w × 31.0 × 3 = 3.0

Target7

Loss = (3.0 - 7)²16.00

Gradient (dLoss/dw)-24.0

→ Increase w to reduce Loss

Chain Rule: The Core Principle of Backpropagation

In a real neural network, weight w passes through multiple operations before affecting the final Loss. It's like dominoes.

w → (w × x) → (sum) → (activation) → (next layer) → ... → Loss

To find "how much Loss changes when w changes," you multiply the effect of each step in this domino chain on the next step. That's the Chain Rule.

Developer analogy: Error tracing

An error occurred in production (= Loss is high)
You trace the stack trace to find the root cause.

Error in server logs (output)
← caused by: Service Layer (hidden layer)
  ← caused by: wrong DB query (weight)

Backpropagation works the same way. Starting from the Loss (error), it traces backward (Back) layer by layer to figure out "who's most responsible (gradient)."

That's why it's called Back-propagation.

A Tiny Backpropagation

Let's walk through the entire process with numbers using a very simple example.

Problem setup:

1 neuron. Input x = 2, weight w = 3, bias b = 1
Prediction = w * x + b, target = 10, Loss = (prediction − target)²

Interactive — Full Backpropagation ProcessAdjust w and b to observe the full Forward → Loss → Backward flow. Goal: find w and b values that bring Loss close to 0!

Weight w3.75

Bias b3.50

① Forward Pass

Prediction = w×x + b = 3.75×2 + 3.5011.00

② Loss Calculation

Loss = (11.00 - 10)²1.00

③ Backward Pass (Chain Rule)

dLoss/dPrediction = 2×(11.00 - 10)2.00

dPrediction/dw = x = 22

dLoss/dw = 2.00 × 24.00

dLoss/db = 2.00 × 12.00

④ Update (lr=0.01)

New w = 3.75 - 0.01×4.003.7100

New b = 3.50 - 0.01×2.003.4800

Things start getting a bit harder from here... but let's push through!

Here's how I understand the Chain Rule:

[Goal] We want to reduce Loss
        ↑
[Question] How much does Loss change if we nudge w? = dLoss/dw
        ↑
[Decomposition] w doesn't change Loss directly — it goes through a chain:

    Nudging w → changes prediction → changed prediction → changes Loss

    dLoss      dLoss     dPrediction
    ─────   = ─────── × ───────────
     dw       dPrediction     dw

What I ultimately want is dLoss/dw, but since it's hard to compute directly,
I compute the two intermediate pieces first and multiply them.
  - dLoss/dPrediction
  - dPrediction/dw

So to get dLoss/dw, we need dLoss/dPrediction and dPrediction/dw. Let's compute each one.

The scenario we'll use:

w = 3.75, b = 3.50
1) Forward pass
Prediction = 3.75 * 2 + 3.50 = 11.00
2) Loss calculation
Loss = (11.00 − 10.00)² = 1.00

Intermediate Step 1: dLoss/dPrediction

dLoss/dPrediction = how much does Loss change when the prediction value changes?

Let's experiment by nudging the prediction slightly. Current prediction is 11.00, target is 10.

Experiment 1: Nudge by 0.05

Original prediction: 11.00 → Loss = (11.00 − 10)² = 1.0000
New prediction: 11.05 → Loss = (11.05 − 10)² = 1.1025
Change in Loss: 0.1025
Rate of change: 0.1025 / 0.05 = 2.05

Experiment 2: Nudge by a smaller amount, 0.001

Original prediction: 11.00 → Loss = 1.000000
New prediction: 11.001 → Loss = (11.001 − 10)² = 1.002001
Change in Loss: 0.002001
Rate of change: 0.002001 / 0.001 = 2.001

Notice how the rate of change converges to 2.0 as the interval gets smaller?

This "rate of change when you shrink the interval to its limit" is the derivative, and that value is the gradient. In this case, dLoss/dPrediction = 2.0.

Intermediate Step 2: dPrediction/dw

dPrediction/dw = how much does the prediction change when weight w changes?

This one is much simpler. Since prediction = w × 2 + b, let's nudge w.

w = 3.75 → prediction = 3.75 × 2 + 3.50 = 11.00
w = 3.76 → prediction = 3.76 × 2 + 3.50 = 11.02
Change in prediction: 0.02
Rate of change: 0.02 / 0.01 = 2.0

Nudging w by 0.01 changed the prediction by 0.02. No matter how much you nudge w, the ratio is always 2.0. That's because in prediction = w × 2 + b, the coefficient on w is x = 2.

In other words, dPrediction/dw = x = 2.0

Final Calculation: dLoss/dw

Now let's combine the two pieces.

    dLoss      dLoss       dPrediction
    ─────   = ─────── ×   ───────────
     dw       dPrediction      dw

    dLoss/dw = 2.0 × 2.0 = 4.0

This means: increasing w by 1 increases Loss by 4. Decreasing w by 1 decreases Loss by 4.

Rather than making large adjustments, it's better to adjust in increments of 0.01 (the learning rate, lr).

So the new weight (w) is:

New w = old w − lr(0.01) × dLoss/dw
New w = 3.75 − 0.01 × 4.0
New w = 3.71

One More Calculation: Bias b

We're not done yet. We need to compute the same thing for bias b.

Same approach as dPrediction/dw. Since prediction = w × 2 + b, the coefficient on b is 1:

Nudge b by 0.01 → prediction changes by exactly 0.01
Rate of change: always 1.0
dPrediction/db = 1.0

And dLoss/db = dLoss/dPrediction (reusing the value from earlier) × dPrediction/db = 2.0 × 1.0 = 2.0

So the new b is:

New b = old b − lr(0.01) × dLoss/db
New b = 3.50 − 0.01 × 2.0
New b = 3.48

Backpropagation Summary

Why do we compute the gradient for w and b separately?

Because they affect the prediction (Forward Pass) to different degrees.
w gets multiplied by input x before being added to the prediction, while b is added directly.
In other words, w's influence on the prediction scales with x, while b always contributes exactly 1.

What's the benefit of using the Chain Rule?

It breaks down something hard to compute all at once into easy pieces.
Computing dLoss/dw directly is complex, but computing dLoss/dPrediction and dPrediction/dw separately then multiplying is straightforward.
The advantage becomes clear at neural network scale:
- Even with 26 layers, each layer's gradient is just a simple multiplication.
- Multiply layer by layer from back to front, and you get the full gradient.

The Core Idea of Backpropagation

Applying the Chain Rule "from back to front"
Computational efficiency:
- Starting from the output and going backward lets you reuse intermediate gradients.
  - Example: compute dLoss/dPrediction once, reuse it for both dLoss/dw and dLoss/db
- Going front to back instead? You'd have to recompute from scratch for every parameter.
- With 500 million parameters? The difference in computation is enormous.

6. The Training Loop

Forward Pass + Loss + Backpropagation together = "learning"

Training Loop = It's All Repetition

Putting everything we've learned together, neural network training is a surprisingly simple loop:

1. Forward Pass
  : Feed data in and compute the prediction
2. Loss Calculation
  : Measure how far the prediction is from the answer
3. Backward Pass (Backpropagation)
  : Trace back from the Loss to compute each parameter's gradient
4. Parameter Update
  : Nudge each parameter in the opposite direction of its gradient
5. Go back to step 1 and repeat
  : Repeat tens of thousands to billions of times, and Loss gradually decreases

In pseudocode:

for (step = 0; step < 1_000_000; step++) {

  // 1. Forward: predict
  prediction = model.forward(input_data);

  // 2. Loss: how wrong?
  loss = (prediction - target) ** 2;

  // 3. Backward: compute each parameter's gradient
  gradients = backpropagate(loss);

  // 4. Update: adjust in the opposite direction of the gradient
  for each param in model.parameters:
    param -= learning_rate * gradients[param];
}

Getting a Sense of Real Scale

	Pseudocode above	NANOCHAT GPT-2
Parameters	A few dozen	~568 million
Training data	A handful	~4.5 billion tokens (web text)
Iterations	Thousands	Tens of thousands of steps (each step = 500K tokens)
Time	1 second	~1.65 hours (8×H100 GPUs)
Cost	$0	~$48

7. Connecting to LLMs and nanochat

How everything we've learned applies to real LLMs

An LLM = A Massive Neural Network Optimized for "Next Word Prediction"

Every concept we've covered applies directly to LLMs. The only thing that changes is "what it's predicting."

LLM training loop:

1. Forward: "The sky is" → pass through neural network → predict probability distribution for next word
2. Loss: Compare with the correct answer "blue." If "blue" had low probability, Loss is high
3. Backward: Compute gradients for all 568 million parameters
4. Update: Adjust parameters so "blue" gets higher probability
5. Repeat: Do this for 4.5 billion tokens

Do this to completion → you get a neural network that "understands English."

Mapping to nanochat Code

Concept	NANOCHAT File	Role
Network architecture	`gpt.py`	Transformer model definition (neurons, layer structure)
Forward Pass	`gpt.py`'s `forward()`	Input tokens → next token probabilities
Loss calculation	`base_train.py`	Cross-entropy loss (for classification)
Backpropagation	PyTorch's `.backward()`	Automatic gradient computation (autograd)
Parameter update	`optim.py`	AdamW + Muon optimizer
Training data	`dataloader.py`	ClimbMix data loading
Inference (generation)	`engine.py`	Generate tokens one at a time with the trained model

I'm planning to build a model myself, hands-on, one piece at a time, using Andrej Karpathy's nanochat modules directly.

Going forward, I'll be analyzing nanochat's code one piece at a time, running it myself, and studying whatever additional concepts come up along the way. I'll share everything through this blog.

TIL-05: Deep Learning 01