My Curated Resources for CUDA, Accelerators, & AI Inference
This is a collection of my favorite reading materials & resources to help with learning about CUDA programming, GPUs, AI inference, & related topics. From foundational papers to modern optimization techniques & practical guides, these resources are designed for anyone interested in diving deeper into the world of GPU computing & AI.
Industry & Expert Voices
Semianalysis
Deep industry analysis on AI hardware economics & roadmaps. $500/year subscription, but worth it (even on free tier).
Visit the site
Andrej Karpathy
Needs no introduction. :)
Visit the channel
Simon Boehm’s Blog (Anthropic Performance Team)
Expert-level posts on CUDA optimization, GEMM, & parallelism.
Check out the blog
Edward Z. Yang’s Blog
Deep dives into PyTorch internals, compilers, & scalable ML systems.
Read Edward’s blog
High Yield
Deep-dive analysis & commentary on chips, GPUs, & semiconductor trends. Great for staying sharp on the hardware underpinnings of AI acceleration.
YouTube
AI News Aggregators
- smol.ai for daily AI news
GPUmode - The Biggest GPU Nerd Community
If you want to hang out where CUDA, PTX, & GPU architecture obsessives gather, GPUmode is the place. It’s part community, part knowledge base, & part rabbit hole of GPU lore.
Foundational Readings
NVIDIA Tesla: A Unified Graphics & Computing Architecture
The 2008 paper by Lindholm et al. that introduced Tesla & CUDA as a unified model for graphics + compute. A must-read for historical grounding.
Read the paper
A History of NVIDIA’s Stream Multiprocessor (Fabien Sanglard)
A beautifully written walkthrough of how NVIDIA’s SM evolved from Tesla through Turing. Great for architectural intuition.
Check out the resource
CUDA C++ Programming Guide (Official)
The canonical reference from NVIDIA. Dry but essential.
Read the guide
Practical Guides & Tools
(Mis)adventures in Running CUDA on Google Colab Free Tier (Shashank Shekhar)
A hands-on debugging journey through Colab’s CUDA quirks; toolchain mismatches, PTX issues, & workarounds.
Read the blog
Compiler Explorer (godbolt.org)
An interactive playground to inspect CUDA/PTX assembly & understand what your kernels compile down to.
Visit the site
GPU Glossary (Modal)
A one-stop glossary for GPU terms, from SMs to tensor cores. Perfect for quick lookups.
Explore the glossary
Scaling & Inference
Scaling ML Models (Google Engineers’ Guide)
A comprehensive “systems thinking” book on scaling Transformers across accelerators. Covers roofline analysis, parallelism strategies, & inference trade-offs.
Explore the guide
Domain-Specific Architectures for AI Inference (Fleetwood.dev)
A thoughtful post on how inference hardware is tuned for workloads like LLMs.
Read the article
Fast LLM Inference From Scratch (Andrew Chan)
A practical deep dive into implementing efficient inference kernels.
Read the post
LLM From Scratch: Automatic Differentiation (Ben Clarkson)
A companion piece on building scalar autograd for LLMs.
Read the article
Introduction to Cloud TPU (Google)
For contrast with GPUs, this doc explains TPU architecture & use cases.
Read the documentation
This list is a work in progress, a personal map of the GPU/accelerator & scaling landscape. I’ll keep adding resources as I organize my read-list.