This post is a summary of 'The spelled-out intro to neural networks and backpropagation: building micrograd', one of Andrej Karpathy's neural network lecture videos.

micrograd

micrograd is a library Karpathy released years ago — a tiny, intuitive scalar-valued autograd engine.

It implements backpropagation, the core algorithm of neural networks, to compute the gradients of a loss function with respect to the network's weights.

Unlike modern deep learning libraries like PyTorch and JAX that work with multi-dimensional arrays called "tensors," micrograd operates at the level of individual scalar (single number) values like -4 or 2. Karpathy says that "implementing backpropagation at the scalar level rather than the tensor level is more helpful for understanding the core principles of deep learning." In other words, he decided that learning neural networks with complex tensors from the start isn't educationally useful.

By stripping away complex tensor operations and building a neural network at the most fundamental level through this library, even non-specialists like me can fully understand the underlying principles of how backpropagation and the Chain Rule work.

micrograd is all you need to train neural networks, and everything else is just efficiency. - Andrej Karpathy

Derivatives: Computing the Rate of Change

Let's start with a very simple example.

f'(x) = \lim_{h \to 0} \frac{f(x + h) - f(x)}{h}

At some point x we care about, we nudge the input by a tiny amount (h) and ask how much the output responds. In other words, the slope at a specific point represents the sensitivity to change at that point.

Now let's explain this with Python code.

import numpy as np
import matplotlib.pyplot as plt

def f(x):
  return 3*x**2 - 4*x + 5

xs = np.arange(-5, 5, 0.25)
ys = f(xs)
plt.plot(xs, ys)

We define a function f. Then we assign values from -5 to 5 at intervals of 0.25 to xs, and assign f(xs) to ys.

Plotting xs and ys gives us the following graph:

Now let's compute the slope at h=0.001, x=3.

h = 0.001
x = 3.0
f(x) # 20.0
f(x + h) # 20.014003000000002
f(x + h) - f(x) # 0.01400300000000243

What happens when we nudge x slightly in the positive direction? f(x) is 20.0. Adding a change of h gives us 20.014003000000002. The amount the function responded is the difference between these values (0.01400300000000243).

(f(x + h) - f(x)) / h # 14.00300000000243

We can get an approximation of the slope through this expression. The closer h gets to 0 (instead of 0.001), the more precisely this value converges to exactly 14.

So the slope turns out to be 14.

Putting this into words: at the point where x is 3, when we increase x by a tiny amount (h), f(x) increases by 14 times h.

Now let's look at a more complex example.

a = 2.0
b = -3.0
c = 10.0
d = a*b + c

This is a function of three scalar inputs. a, b, and c are specific values representing three inputs in this expression. And there's one output, d.

What we want to do here is look at the derivative of d with respect to a, b, and c.

To intuitively understand what the derivative tells us, Karpathy uses a bit of a shortcut here.

h = 0.0001

# inputs
a = 2.0
b = -3.0
c = 10.0

d1 = a*b + c
a += h # increase a by h
d2 = a*b + c # d value when a is increased by h

print('d1', d1) # 4
print('d2', d2)
print('slope', (d2 - d1) / h)

The key here is d2. Thinking intuitively — will it be larger or smaller than d1 (4)? This tells us the sign of the derivative. Since we increased a by h and multiplied it by the negative value b (-3), d2 has to be smaller than d1 (4).

d1: 4.0
d2: 3.999699999999999

It changed from d1 (4.0) to d2 (3.999699999999999). So d2 got smaller than d1. What does that make the slope?

d2 - d1: how much the function responded when we nudged the value up slightly
(d2 - d1) / h: the slope
- -3.000000000010772
- So the slope is -3.
In other words, when we increase a by h, the function d changes by -3 times h. (The sensitivity is -3.)

Now that we've looked at the derivative of d with respect to a, let's look at the derivatives with respect to b and c.

d1 = a*b + c
b += h
d2 = a*b + c

print('d1', d1) # 4
print('d2', d2)
print('slope', (d2 - d1) / h)

d1: 4.0
d2: 4.0002
(d2 - d1) / h: 2.0000000000042206
- So the slope is 2.
- When we increase b by h, the function d changes by 2 times h. (The sensitivity is 2.)

d1 = a*b + c
c += h
d2 = a*b + c

print('d1', d1) # 4
print('d2', d2)
print('slope', (d2 - d1) / h)

d1: 4.0
d2: 4.0001
(d2 - d1) / h: 0.9999999999976694
The slope is 1.
- Intuitively, since c is added to d, increasing c by h should increase d by h as well. So the slope is 1.

micrograd: The Value Object

Karpathy created the Value object to implement these kinds of operations in Python code. This allows us to track the history of computations.

class Value:
  def __init__(self, data, _children=(), _op='', label=''):
    self.data = data
    self._prev = set(_children)
    self._op = _op
    self.label = label

  def __add__(self, other):
    out = Value(self.data + other.data, (self, other), '+')
    return out

  def __mul__(self, other):
    out = Value(self.data * other.data, (self, other), '*')
    return out

Value is a wrapper class for numeric (float) values — it wraps a single scalar (number) value like -4 or 2 inside its data property.

What the Value Object Does (Building a Computation Graph)

data: the current numeric value (e.g., -6.0)
grad: the derivative. Initialized to 0.0.
_children: the parent numbers used to create this number (e.g., a, b). Acts as a kind of pointer.
_op: the operator used to create this number (e.g., '*')
label: the name of this node. Used for debugging.

Why We Create a Separate Value for Computations

1. The Limitation of Plain Numbers (Information Loss)

When you compute a = 2.0; b = -3.0; d = a * b in Python, d only stores the result -6.0. Once this computation is done, the fact that -6 was created by multiplying a and b is forgotten.

This is a critical problem for implementing backpropagation, the core concept of deep learning. Even if you wanted to find the derivatives of a and b, you can't — because that connection has been severed.

2. Why We Use Value

By recording "who gave birth to me and through what operation" with every computation, we can trace back (Backward) one step at a time after the entire computation is finished.

Start from the output value,
Look at the recorded operations ( $+$ , $*$ , etc.),
Tell the parents: "My gradient is this much, so your gradients are about this much." (Using the Chain Rule.)

As a result, the user just writes expressions like d = a * b + c as usual, and the Value object automatically builds a massive computational graph behind the scenes. Then with the d.backward() command at the end, the derivatives of all input values are computed automatically.

Graph Visualization

Karpathy implemented a function using the graphviz library to visualize the relationships between these Value objects. Visualizing the computation graph looks like this:

a = Value(2.0, label='a')
b = Value(-3.0, label='b')
c = Value(10.0, label='c')
e = a*b; e.label = 'e'
d = e + c; d.label = 'd'
f = Value(-2.0, label='f')
L = d * f; L.label = 'L'

Visualized graph:

In this graph, each node represents a Value object, and the arrows represent the flow of operations. For example, we can see that the e node was created by multiplying a and b. This is a visualization of the forward pass — how multiple input values (a, b, c, f) flow through operations to produce a single output value (L).

Next, what we need to do is run backpropagation, where we start from the end (L) and go backward, computing the gradient for every intermediate value.

Backpropagation

The derivative of L with respect to itself is simply 1.

We need to find: dL/df, dL/dd, dL/dc, dL/de, dL/db, and dL/da.

In neural networks, finding the derivative of L with respect to each node is critical. This tells us how each weight affects the final output (L).

Nodes can be broadly classified into two types:

Data nodes: Fixed values given as external data that can't be arbitrarily changed. So even though we compute their gradients, we don't actually use them.
Weight nodes: Weights and biases are "values we control and can change freely." We update their values according to the gradient to improve the final output.

In our expression, if we designate a and b as data nodes (inputs) among a, b, c, and f, then c and f become weight nodes. However, during backpropagation, we still need to compute the derivatives for all intermediate nodes (d, e) as well.

Backpropagation Calculations

Now let's understand how backpropagation works by manually computing the derivative of L with respect to each node. To do this, we'll add a new property called grad to each node and store the derivative in it.

class Value:
  def __init__(self, data, _children=(), _op='', label=''):
      self.data = data
      self.grad = 0.0
      self._prev = set(_children)
      self._op = _op
      self.label = label

We add grad, initialized to 0.0.

Backpropagation starts from L and goes backward, computing the gradient for every intermediate value. So the very first step is finding the derivative of L with respect to L. Let's write a function lol to compute the derivative for each node.

def lol():

  h = 0.001

  a = Value(2.0, label='a')
  b = Value(-3.0, label='b')
  c = Value(10.0, label='c')
  e = a*b; e.label = 'e'
  d = e + c; d.label = 'd'
  f = Value(-2.0, label='f')
  L = d * f; L.label = 'L'
  L1 = L.data

  a = Value(2.0, label='a')
  b = Value(-3.0, label='b')
  c = Value(10.0, label='c')
  e = a*b; e.label = 'e'
  d = e + c; d.label = 'd'
  f = Value(-2.0, label='f')
  L = d * f; L.label = 'L'
  L2 = L.data + h

  print((L2 - L1) / h)

lol() # 1.000000000000334

The derivative of L with respect to L is 1.
We initialize the grad property of L to 1 with L.grad = 1.0.

So what are the next nodes we need to find the derivatives for? The parent nodes of L: d and f.

Derivative of L with Respect to d

// L = d * f
(f(x+h)-f(x))/h
 = ((d+h)*f - d*f)/h
 = (d*f + h*f - d*f) / h
 = h*f / h
 = f

dL/dd = f
d.grad = -2.0

Derivative of L with Respect to f

// L = d * f
(f(x+h) - f(x)) / h
=> ((f+h)*d - d*f) / h
=> (f*d + h*d - d*f) / h
=> h*d / h
=> d

dL/df = d
f.grad = 4.0

Starting from L and going backward, we computed the derivatives of d and f.

dL / dc

Derivative of L with Respect to c

Karpathy says we're now reaching the heart of backpropagation, and this is probably the single most important node to understand.

If you understand the gradient for this node, you understand everything. Basically all of backpropagation and neural network training. - Andrej Karpathy

Now let's derive dL / dc. The manual backpropagation calculation is the same.

Before computing dL/dc directly, let's think about what we already know:

The derivative of L with respect to d: how sensitive L is to d
How much c affects d

Intuitively, if we know how c affects d and how d affects L, we can somehow combine that information to figure out how much c affects L.

We already know the derivative of L with respect to d — we just need to compute the derivative of d with respect to c!

Derivative of d with Respect to c

// d = e + c
(f(x+h) - f(x)) / h
=> (((c + h) + e) - (c + e)) / h
=> (c + h + e - c - e) / h
=> h / h
=> 1.0

At this point, we can apply the Chain Rule — one of the most important core ideas in deep learning.

The Chain Rule

The chain rule is, simply put, the cascading of ratios.

We already know these two facts:

When d changes, L changes by -2x (dL/dd = -2.0)
When c changes, d changes by 1x (dd/dc = 1.0)

So the answer to "when c changes, how much does L ultimately change?" is dead simple. Just multiply these two ratios. dL/dc = dL/dd * dd/dc = -2.0 * 1.0 = -2.0

That's all there is to the chain rule.

There's a famous analogy for explaining the chain rule:

If a car is 2 times faster than a bicycle, and a bicycle is 4 times faster than a person, then a car is 8 times faster than a person. (by George F. Simmons)

Rather than computing how many times faster a car is than a person all at once, you compute how much faster each stage is, then multiply those values together to get the final answer. Let's apply this exact principle to our graph.

A car is 2 times faster than a bicycle
- The sensitivity of L to d is -2
A bicycle is 4 times faster than a person
- The sensitivity of d to c is 1
A car is 8 times faster than a person
- The sensitivity of L to c is -2

The Addition Node

In the expression above, dd/dc is simply 1.0. This is because of the nature of addition — when one input changes by 1, the output changes by 1 as well.

So the addition node simply passes the gradient of the output straight through to the inputs. Applying this rule, dd/de is also 1.0, and therefore dL/de is also -2.0. Even when applying the chain rule, you're multiplying by 1.0, so the original value is preserved as-is.

You can see that dL/dd = -2.0 is passed through directly to dL/dc and dL/de.

Completing the Backpropagation

Coming back to the backpropagation calculation, let's compute the derivatives for the last remaining nodes: a and b.

I'd recommend trying this calculation yourself using the same manual approach we've been using. Once all calculations are done and every node's grad property is assigned, redrawing the graph gives us:

Alright — we've now completed backpropagation by manually computing the gradient for every node from the first to the last. All we did was traverse every node one by one and apply the chain rule.

Optimizing L and Updating Parameters

The reason we've been computing each node's gradient (grad) is ultimately to bring the final output L closer to our target value (0).

Formula: move by step_size

Mathematically, the gradient points in the direction that increases the function value the fastest. So when adjusting variables using derivatives, the sign (+ or -) depends on what our goal is.

When we want to increase L (our current situation):
- Since we just need to go in the direction the gradient points (the "increasing" direction), we use addition (+=). (Gradient Ascent)
When we want to decrease L (the typical deep learning scenario):
- Since we need to go opposite to the direction the gradient points (away from "increasing"), we use subtraction (-=). (Gradient Descent)

In our current example, L is -8.0. To bring it closer to our target of 0, we need to increase the value, so we use the += approach — adding in the direction of the gradient.

step_size = 0.01

# Adjust variables in the direction that increases L (using +=)
a.data += step_size * a.grad
b.data += step_size * b.grad
c.data += step_size * c.grad
f.data += step_size * f.grad
# e and d are intermediate nodes, so we don't update them.

# Recompute after the update (Fast Forward)
e = a * b
d = e + c
L = d * f
print(L.data) # -7.286496 (closer to 0)

Since we adjusted all input values in the direction of the gradient, we can see that L got a bit larger.

Initial L: -8.0
L after update: -7.286496

In typical deep learning, the loss is positive and the goal is to minimize it, so you'd normally use -=. But in our case — nudging a negative result toward 0 — we've successfully performed one step of "optimization" using +=.

This is the entire process of "optimization." We know how each node's gradient affects the final result L. Using this information, we adjust the variables' values to steer L toward our target direction (0).

That wraps up this post. I covered roughly the first 51 minutes of the original video.

In the next post, I'll cover the second half of the video.

TIL-06: Deep Learning 02