Sep 24, 2025

My Curated Resources for CUDA, Accelerators, & AI Inference

This is a collection of my favorite reading materials & resources to help with learning about CUDA programming, GPUs, AI inference, & related topics. From foundational papers to modern optimization techniques & practical guides, these resources are designed for anyone interested in diving deeper into the world of GPU computing & AI.

Industry & Expert Voices

Semianalysis
Deep industry analysis on AI hardware economics & roadmaps. $500/year subscription, but worth it (even on free tier).
Visit the site

Andrej Karpathy
Needs no introduction. :)
Visit the channel

Simon Boehm’s Blog (Anthropic Performance Team)
Expert-level posts on CUDA optimization, GEMM, & parallelism.
Check out the blog

Edward Z. Yang’s Blog
Deep dives into PyTorch internals, compilers, & scalable ML systems.
Read Edward’s blog

High Yield
Deep-dive analysis & commentary on chips, GPUs, & semiconductor trends. Great for staying sharp on the hardware underpinnings of AI acceleration.
YouTube

AI News Aggregators


GPUmode - The Biggest GPU Nerd Community

If you want to hang out where CUDA, PTX, & GPU architecture obsessives gather, GPUmode is the place. It’s part community, part knowledge base, & part rabbit hole of GPU lore.


Foundational Readings

NVIDIA Tesla: A Unified Graphics & Computing Architecture
The 2008 paper by Lindholm et al. that introduced Tesla & CUDA as a unified model for graphics + compute. A must-read for historical grounding.
Read the paper

A History of NVIDIA’s Stream Multiprocessor (Fabien Sanglard)
A beautifully written walkthrough of how NVIDIA’s SM evolved from Tesla through Turing. Great for architectural intuition.
Check out the resource

CUDA C++ Programming Guide (Official)
The canonical reference from NVIDIA. Dry but essential.
Read the guide


Practical Guides & Tools

(Mis)adventures in Running CUDA on Google Colab Free Tier (Shashank Shekhar)
A hands-on debugging journey through Colab’s CUDA quirks; toolchain mismatches, PTX issues, & workarounds.
Read the blog

Compiler Explorer (godbolt.org)
An interactive playground to inspect CUDA/PTX assembly & understand what your kernels compile down to.
Visit the site

GPU Glossary (Modal)
A one-stop glossary for GPU terms, from SMs to tensor cores. Perfect for quick lookups.
Explore the glossary


Scaling & Inference

Scaling ML Models (Google Engineers’ Guide)
A comprehensive “systems thinking” book on scaling Transformers across accelerators. Covers roofline analysis, parallelism strategies, & inference trade-offs.
Explore the guide

Domain-Specific Architectures for AI Inference (Fleetwood.dev)
A thoughtful post on how inference hardware is tuned for workloads like LLMs.
Read the article

Fast LLM Inference From Scratch (Andrew Chan)
A practical deep dive into implementing efficient inference kernels.
Read the post

LLM From Scratch: Automatic Differentiation (Ben Clarkson)
A companion piece on building scalar autograd for LLMs.
Read the article

Introduction to Cloud TPU (Google)
For contrast with GPUs, this doc explains TPU architecture & use cases.
Read the documentation


This list is a work in progress, a personal map of the GPU/accelerator & scaling landscape. I’ll keep adding resources as I organize my read-list.