Building ToyGrad - A Journey Into the Heart of Automatic Differentiation
For years, I've been working with PyTorch, building neural networks, training models, and marveling at how effortlessly .backward()
computes gradients across complex computational graphs. It felt like magic - you define your forward pass, call a single method, and somehow PyTorch knows exactly how to propagate gradients back through every operation, every layer, every parameter.
But the curious part of me was never satisfied with just using this magic. How does it actually work? What happens when you call .backward()
? How does PyTorch build and traverse computational graphs? These questions kept nagging at me, especially during late-night debugging sessions when I'd stare at gradient flows and wonder about the elegant mathematics happening behind the scenes.
So I decided to build my own autograd engine from scratch. Meet ToyGrad - my personal exploration into the fundamentals of automatic differentiation.
The "Why" Behind ToyGrad
Working as a Machine Learning Engineer, I've built countless models, implemented complex architectures, and optimized training loops. But there's something profoundly different between using a tool and truly understanding it. It's the difference between being a passenger and being the driver.
I wanted to understand:
- How computational graphs are built dynamically during forward passes
- How gradients flow backward through operations using the chain rule
- Why topological sorting is crucial for correct gradient computation
- How modern frameworks handle memory efficiently during backpropagation
PyTorch makes all of this invisible to the user (which is wonderful for productivity), but I needed to see the gears turning.
Special
Class
Starting Simple: The The heart of ToyGrad is the Special
class—a scalar value that tracks gradients:
Each Special
object knows:
- Its current value (
data
) - Its gradient (
grad
) - Which operations created it (
_prev
and_op
) - How to compute gradients for its inputs (
_backward
)
This is surprisingly similar to PyTorch tensors, just stripped down to the absolute essentials.
Building the Computational Graph
The magic happens when you implement operations. Take multiplication:
Each operation creates a new node and stores exactly how to compute gradients for its inputs. The computational graph builds itself!
The Backward Pass: Topological Sorting in Action
Here's where the real magic happens. When you call .backward()
, ToyGrad needs to:
- Find all nodes in the computational graph
- Sort them topologically (ensuring dependencies are computed first)
- Apply the chain rule by calling each node's
_backward
function
Watching this work for the first time was genuinely exciting. You could trace exactly how gradients flowed from the loss back to every parameter.
Beyond Scalars: Building Neural Networks
Once the autograd engine worked, I built higher-level abstractions:
The beautiful thing is that once you have automatic differentiation working, building neural networks becomes straightforward. Each layer is just a composition of operations that the autograd engine already knows how to differentiate.
Testing on Real Problems
I tested ToyGrad on progressively complex problems:
- Simple optimization: Minimizing
f(x) = (x - 3)^2
- Logistic regression: Binary classification with sigmoid activation
- Multi-layer networks: Solving non-linear problems like the half-moons dataset
Here's what training a neural network looks like:
Seeing this work -- watching the loss decrease, gradients flow correctly, and the model learn -- was incredibly satisfying. This is exactly what PyTorch does, just without the optimizations and with full visibility into every step.
What I Learned
Building ToyGrad taught me several profound lessons:
1. Automatic Differentiation is Elegant
The chain rule, implemented systematically, naturally handles arbitrarily complex computational graphs. There's no magic, just careful bookkeeping and mathematical rigor.
2. Computational Graphs are Dynamic
Unlike symbolic approaches, PyTorch (and ToyGrad) builds graphs during execution. This makes debugging easier and enables dynamic architectures, but requires careful memory management.
3. Topological Sorting is Crucial
You can't just compute gradients in any order. Dependencies matter, and topological sorting ensures gradients are computed correctly.
4. Abstractions Build Naturally
Once you have scalar automatic differentiation, building tensors, layers, and entire frameworks becomes a natural progression of abstractions.
The PyTorch Connection
After building ToyGrad, I have a much deeper appreciation for PyTorch's sophistication:
- Tensor operations: PyTorch doesn't just handle scalars, it efficiently differentiates through complex tensor operations
- Memory optimization: Techniques like in-place operations and gradient checkpointing
- CUDA integration: Seamless GPU acceleration for both forward and backward passes
- Advanced optimizers: Adam, RMSprop, and others with sophisticated momentum and adaptive learning rates
But the core concepts are the same. PyTorch is essentially ToyGrad scaled up with industrial-strength optimizations.
Beyond Understanding: Building Intuition
The most valuable outcome wasn't just understanding how automatic differentiation works, but developing intuition for why certain patterns emerge in deep learning:
- Why gradient clipping helps with training stability
- How different activation functions affect gradient flow
- Why certain architectures are more prone to vanishing gradients
- How batch normalization interacts with the computational graph
These insights directly improved my day-to-day work with PyTorch. When debugging training issues or designing new architectures, I now have a mental model of what's happening at the gradient level.
The Code
You can explore the full implementation at GitHub: ToyGrad. The repository includes:
- Complete autograd engine implementation
- Neural network modules and layers
- Training utilities and optimizers
- Jupyter notebook with examples and experiments
Final Thoughts
Building ToyGrad was one of those rare projects where the journey was more valuable than the destination. I didn't set out to compete with PyTorch (that would be absurd, but not impossible ;) ), but to understand the beautiful mathematics and computer science that makes modern deep learning possible.
If you've ever wondered what happens when you call .backward()
, I encourage you to build your own autograd engine. Start simple, with scalars and basic operations. You'll be amazed at how quickly the magic becomes clear, logical, and elegant.
The next time you're training a complex transformer or debugging a tricky gradient flow issue, you'll have a deeper appreciation for the remarkable engineering that makes it all possible and the fundamental mathematics that makes it all work.