Introduction to Automatic Differentiation

TSTorch Team March 10, 2026

autodiff
machine-learning
typescript

Automatic differentiation (AD) is the backbone of modern deep learning. Unlike symbolic differentiation (which manipulates expressions) or numerical differentiation (which approximates via finite differences), AD computes exact derivatives by decomposing programs into elementary operations and applying the chain rule systematically.

The Chain Rule

Given a composition of functions $f = f_{n} \circ f_{n - 1} \circ \dots \circ f_{1}$ , the chain rule tells us:

\frac{\partial f}{\partial x} = \frac{\partial f _{n}}{\partial f _{n - 1}} \cdot \frac{\partial f _{n - 1}}{\partial f _{n - 2}} \dots \frac{\partial f _{2}}{\partial f _{1}} \cdot \frac{\partial f _{1}}{\partial x}

The order in which we multiply these Jacobians gives rise to two modes of AD.

Forward Mode

Forward mode propagates derivatives alongside the computation, from inputs to outputs. For each input variable $x_{i}$ , we track a dual number $(v_{i}, \overset{v}{˙}_{i})$ where $\overset{v}{˙}_{i} = \frac{\partial v _{i}}{\partial x}$ .

For a function $f : R^{n} \to R^{m}$ , forward mode computes one column of the Jacobian per pass — efficient when $n ≪ m$ .

Reverse Mode

Reverse mode — what frameworks like PyTorch call “backpropagation” — propagates derivatives from outputs to inputs. It first performs a forward pass to build a computation graph, then walks it backward:

\overset{v}{ˉ}_{i} = j \in children (i) \sum \overset{v}{ˉ}_{j} \cdot \frac{\partial v _{j}}{\partial v _{i}}

where $\overset{v}{ˉ}_{i} = \frac{\partial L}{\partial v _{i}}$ is the adjoint. This computes one row of the Jacobian per pass — efficient when $m ≪ n$ , which is exactly the case in ML (scalar loss, many parameters).

A Minimal Value Class

Here’s the core of a reverse-mode AD engine in TypeScript:

class Value {
  data: number;
  grad: number = 0;
  private backward: () => void = () => {};
  private children: Value[];

  constructor(data: number, children: Value[] = []) {
    this.data = data;
    this.children = children;
  }

  add(other: Value): Value {
    const out = new Value(this.data + other.data, [this, other]);
    out.backward = () => {
      this.grad += out.grad;
      other.grad += out.grad;
    };
    return out;
  }

  mul(other: Value): Value {
    const out = new Value(this.data * other.data, [this, other]);
    out.backward = () => {
      this.grad += other.data * out.grad;
      other.grad += this.data * out.grad;
    };
    return out;
  }

  backprop(): void {
    const topo: Value[] = [];
    const visited = new Set<Value>();
    const build = (v: Value) => {
      if (!visited.has(v)) {
        visited.add(v);
        v.children.forEach(build);
        topo.push(v);
      }
    };
    build(this);
    this.grad = 1;
    for (const v of topo.reverse()) {
      v.backward();
    }
  }
}

Interactive Computation Graph

Try adjusting the inputs below to see how gradients flow through a simple computation $y = w \cdot x + b$ :

x = w = b =

Computation graph: y = w·x + b. Toggle gradients to see ∂y/∂(each variable).

Key Takeaways

Forward mode is $O (n)$ passes for $n$ inputs — good for few inputs, many outputs
Reverse mode is $O (m)$ passes for $m$ outputs — ideal for ML’s scalar-loss setup
The computation graph records operations during the forward pass, then gradients propagate backward through the graph
TSTorch implements reverse-mode AD, just like PyTorch, but targeting TypeScript and WebGPU