A Number with a Shadow

March 02, 2026

hi. let me tell you about a number trick.

You can measure a curve with a ruler. Lay it across two points, read the slope. Zoom in and the ruler shrinks, so the approximation gets better. Keep zooming and your hands start shaking. At some point you're measuring the shake, not the curve.

That isn't a metaphor, it's what happens when you compute a derivative with floating point arithmetic:

let dfdx = (f(x + h) - f(x)) / h;

Pick a small h, evaluate twice, divide. It worked! Then I made h smaller and it worked better. Then I made h way smaller and it got worse.

The step-size trap

Here's sin(x) at x = 1. The green line is the true tangent, slope cos(1) ≈ 0.54, exact. The red line is the secant from finite differences, connecting (x, sin x) and (x+h, sin(x+h)), slope approximated by dividing.

Drag h down. Watch the red line approach the green one. Now keep going past 10⁻⁸. Watch it fall apart.

h100.00

tangent = 0.540302secant = -0.003894error = 5.44e-1

the ruler is shorter than the smallest scratch it can see.

This is catastrophic cancellation. At tiny h, f(x + h) and f(x) are nearly identical in floating point. Subtract them and you're left with noise. Divide that noise by h and you amplify it. At h = 10⁻¹⁵ the secant is computing its slope from two points that are the same number in f64.

There's a sweet spot around h ≈ 10⁻⁸ where truncation error and cancellation error balance out. Here's the V-shaped trap. Drag h to explore it, then switch to the pathological function and watch the sweet spot collapse:

h1e-5

approx = 25.000090exact = 25.000000error = 9.00e-5sweet spot: h ≈ 10⁻⁸ → error ≈ 10⁻⁸

The polynomial gets to ~10⁻⁸ error at the sweet spot. The pathological function barely hits 10⁻⁵. And the green dashed line at the bottom? Still exact, and still no h.

The idea has been around since 1964.

A number with a shadow

wait. what if the derivative just... came along?

A dual number is a pair: a value and its derivative. The value is the number you know. The derivative is its shadow, tracking how the number changed to get where it is.

The arithmetic rules are the rules you already know from calculus:

Addition: (a, a') + (b, b') = (a + b, a' + b') Derivatives add.

Multiplication: (a, a') × (b, b') = (a·b, a·b' + a'·b) That's the product rule from calc.

sin: sin(a, a') = (sin a, a' · cos a) sqrt: √(a, a') = (√a, a' / 2√a) That's the chain rule, both times.

Every operation carries the derivative forward using a rule from first-semester calculus, with no limits, no step sizes, and no approximation. The derivative just comes along for the ride.

Watch it work

f(x) = x³ − 2x + 1 at x = 3. Seed x = (3, 1), value 3, derivative 1 because ∂x/∂x = 1. Click through and watch the chain rule propagate:

f(x) = x³ − 2x + 1  at  x = 31/6

seedseed: ∂x/∂x = 1

Check: f'(x) = 3x² − 2. At x = 3: 3(9) − 2 = 25. The chain rule did all the work, one node at a time.

25 lines of Rust

Here's a complete dual number type:

#[derive(Clone, Copy)]
struct Dual { val: f64, dot: f64 }

impl std::ops::Add for Dual {
    type Output = Self;
    fn add(self, rhs: Self) -> Self {
        Dual { val: self.val + rhs.val, dot: self.dot + rhs.dot }
    }
}

impl std::ops::Mul for Dual {
    type Output = Self;
    fn mul(self, rhs: Self) -> Self {
        // product rule: d(ab) = a·db + da·b
        Dual {
            val: self.val * rhs.val,
            dot: self.val * rhs.dot + self.dot * rhs.val,
        }
    }
}

// Sub, Neg: same pattern. (a,a')-(b,b') = (a-b, a'-b')

impl Dual {
    fn sin(self) -> Self {
        Dual { val: self.val.sin(), dot: self.dot * self.val.cos() }
    }
    fn sqrt(self) -> Self {
        let s = self.val.sqrt();
        Dual { val: s, dot: self.dot / (2.0 * s) }
    }
}

Any function built from these operations gives you exact derivatives, with no tape, no computation graph, and no framework. It's two floats instead of one.

Here's tang computing x·sin(x) in your browser. The Rust code compiled to WebAssembly, running the same Dual type from the library. Drag the point along the curve and watch the tangent line follow, derivative exact at every position:

tang running in your browser via WebAssembly

3.000

f(x)

0.0000

f'(x)

0.0000

let d = Dual::var(x);
let y = d * d.sin();
// y.val = f(x), y.dot = f'(x)

that's real Rust, running in your browser right now.

Type an expression and watch the dual numbers propagate through every operation. Change x. Try the presets. The derivative is always exact.

f(x) =x =

expression	value	derivative	rule
x	2	1	seed variable
(x · x)	4	4	product rule
sin((x · x))	-0.7568	-2.6146	chain rule (sin)

f(2) = -0.7568f′(2) = -2.6146

This is forward-mode AD, great when you have few inputs. For millions of parameters (training neural nets), you want reverse-mode. That's backpropagation. Same mathematical insight, the chain rule applied mechanically, but propagating backward through a recorded tape. tang has both: Dual<S> for forward-mode, tang-ad for reverse-mode with grad, jacobian, hessian.

The type trick

Here's where it gets interesting.

What if your math library was generic over the number type? Not just f64 but any type that implements add, multiply, sin, and sqrt - a trait, call it Scalar:

fn spring_energy<S: Scalar>(pos: Vec3<S>, rest_len: S) -> S {
    let stretch = pos.norm() - rest_len;
    stretch * stretch * S::from_f64(0.5)
}

Call it with f64 and you get plain evaluation. Call it with Dual<f64> and you get the exact gradient. Same function, same code. The derivative is free - you just change the type parameter.

I built this. It's called tang, a Rust math library where every type (Vec3, Mat4, Quat, Transform, DMat, DVec, LU, SVD, Cholesky, eigensolvers) is generic over Scalar. 18 crates, ~50,000 lines, 468 tests, no_std throughout, MIT licensed.

Write your physics once and get exact derivatives by swapping the type.

go green go! ...wait, they both started at the same place.

What does that buy you? Watch two gradient descent particles race down Himmelblau's function. The green particle uses the exact gradient from dual numbers. The red particle uses finite differences. Same learning rate, same start. Drag the h slider to see how step size changes the FD path. Click anywhere to restart from a new position:

● dual (exact)
● FD (h=0.50)

step 0

h0.50

click to move start

What you get

tang's decompositions are generic over Scalar. This is the thing nalgebra can't do. Dual<f64> flows through LU, SVD, Cholesky, QR, eigensolvers. You can differentiate through a linear solve:

let a = DMat::<Dual<f64>>::from_fn(3, 3, |i, j| /* ... */);
let b = DVec::<Dual<f64>>::from_vec(vec![/* ... */]);
let x = Lu::new(a).solve(&b);
// x[i].dot = ∂solution/∂(whatever you seeded)

You get derivatives of the solution, through the decomposition, exact, with no adjoint method by hand and no implicit function theorem. Just change the type.

nalgebra's decompositions are specialized for f64, so they can't accept Dual<f64>. That's a different design choice, not a criticism. But if you need gradients through your solver, you need the generics.

A dual-number LU solve runs in 81ns against 167ns for finite differences on the same derivative, so it's 2x faster and exact.

Feel the derivative

A spring. Force F = −k(x − L). Two derivatives come free from the dual numbers: ∂F/∂k tells you how the force changes if you stiffen the spring. ∂F/∂L tells you how it changes if you lengthen the rest position. Drag the sliders and watch the arrows:

F = 0.00∂F/∂k = 0.00∂F/∂L = 0.00

k3.0

L120px

∂F/∂k says: if you stiffen the spring, I bounce harder ;)

These are the sensitivities an optimizer needs - not "how does the output change if I nudge the input by some small h" but how it changes exactly, analytically, through every operation in the computation.

18 crates

hover a crate

Two trees converging: on the left, algebra (vectors, matrices, decompositions, autodiff, optimization), and on the right, symbolic computation (expression DAGs, GPU compilation, n-d arrays). They meet at training and distributed compute.

tang-train has 40+ layers from Linear through Transformers with GQA, SwiGLU, RotaryEmbedding, LoRA. Hand-(claude)-written backward passes, no tape overhead. AdamW with state persistence. Learning rate schedulers. A PINN module for physics-informed neural networks.

tang-gpu traces Rust math into symbolic DAGs via tang-expr, differentiates symbolically, and compiles to Metal/Vulkan/DX12 through wgpu. tang-compute writes native Metal and CUDA kernels for the hot paths: flash attention, fused activations, simdgroup matrix acceleration.

tang-mesh ships expression graphs over QUIC, not tensors. Each worker compiles locally to its own backend, so your M3 compiles to Metal and a 4090 compiles to CUDA - same graph, different hardware. Data-parallel, pipeline-parallel, tensor-parallel.

Honest about speed

tang's dense LA is pure generic Rust. nalgebra dispatches to BLAS. The gap is real:

Op (n=128)	tang	nalgebra
GEMM	250µs	84µs
LU solve	118µs	107µs

tang works with Dual<f64>, intervals, anything implementing Scalar, while nalgebra's fast path is f64-only, so it's a different tradeoff. Enable tang's faer feature for Apple Accelerate BLAS when you need it.

For geometry primitives, the stuff that actually runs in inner loops:

Op	tang	nalgebra	glam (f32)
vec3 dot	2.2ns	2.2ns	1.6ns
quat slerp	8.1ns	11.0ns	6.3ns
mat4 inverse	12.0ns	16.0ns	8.8ns

That's competitive despite full generics, because the Scalar abstraction costs almost nothing at the call site.

no more picking h.

tang is the math backbone of gaia, a 7B-parameter world model I'm training for physical design. The simulation that generates labels and the network that learns from them use the same types, the same Scalar trait, the same code, and that's the whole point. When the derivative is free, the line between simulation and learning disappears.

cargo add tang tang-la tang-ad

tang.toys

docs · examples · source