A Number with a Shadow
March 02, 2026You can measure a curve with a ruler. Lay it across two points, read the slope. Zoom in and the ruler shrinks — better approximation. Keep zooming and your hands start shaking. At some point you're measuring the shake, not the curve.
That's not a metaphor. That's literally what happens when you compute a derivative with floating point arithmetic:
let dfdx = (f(x + h) - f(x)) / h;Pick a small h. Evaluate twice. Divide. It worked. Then I made h smaller and it worked better. Then I made h way smaller and it got worse.
The step-size trap
Here's sin(x) at x = 1. The green line is the true tangent, slope cos(1) ≈ 0.54, exact. The red line is the secant from finite differences, connecting (x, sin x) and (x+h, sin(x+h)), slope approximated by dividing.
Drag h down. Watch the red line approach the green one. Now keep going past 10⁻⁸. Watch it fall apart.
This is catastrophic cancellation. At tiny h, f(x + h) and f(x) are nearly identical in floating point. Subtract them and you're left with noise. Divide that noise by h and you amplify it. At h = 10⁻¹⁵ the secant is computing its slope from two points that are the same number in f64.
There's a sweet spot around h ≈ 10⁻⁸ where truncation error and cancellation error balance out. Here's the V-shaped trap. Drag h to explore it — then switch to the pathological function and watch the sweet spot collapse:
The polynomial gets to ~10⁻⁸ error at the sweet spot. The pathological function barely hits 10⁻⁵. And the green dashed line at the bottom? Still exact. Still no h.
The idea has been around since 1964.
A number with a shadow
A dual number is a pair: a value and its derivative. The value is the number you know. The derivative is its shadow, tracking how the number changed to get where it is.
The arithmetic rules are the rules you already know from calculus:
Addition: (a, a') + (b, b') = (a + b, a' + b') Derivatives add.
Multiplication: (a, a') × (b, b') = (a·b, a·b' + a'·b) That's the product rule. Literally the product rule from calc.
sin: sin(a, a') = (sin a, a' · cos a) sqrt: √(a, a') = (√a, a' / 2√a) Chain rule. Both of them.
Every operation carries the derivative forward using a rule from first-semester calculus. No limits. No step sizes. No approximation. The derivative just comes along for the ride.
Watch it work
f(x) = x³ − 2x + 1 at x = 3. Seed x = (3, 1), value 3, derivative 1 because ∂x/∂x = 1. Click through and watch the chain rule propagate:
Check: f'(x) = 3x² − 2. At x = 3: 3(9) − 2 = 25. The chain rule did all the work, one node at a time.
25 lines of Rust
Here's a complete dual number type:
#[derive(Clone, Copy)]
struct Dual { val: f64, dot: f64 }
impl std::ops::Add for Dual {
type Output = Self;
fn add(self, rhs: Self) -> Self {
Dual { val: self.val + rhs.val, dot: self.dot + rhs.dot }
}
}
impl std::ops::Mul for Dual {
type Output = Self;
fn mul(self, rhs: Self) -> Self {
// product rule: d(ab) = a·db + da·b
Dual {
val: self.val * rhs.val,
dot: self.val * rhs.dot + self.dot * rhs.val,
}
}
}
// Sub, Neg: same pattern. (a,a')-(b,b') = (a-b, a'-b')
impl Dual {
fn sin(self) -> Self {
Dual { val: self.val.sin(), dot: self.dot * self.val.cos() }
}
fn sqrt(self) -> Self {
let s = self.val.sqrt();
Dual { val: s, dot: self.dot / (2.0 * s) }
}
}Any function built from these operations gives you exact derivatives. No tape. No computation graph. No framework. Two floats instead of one.
Here's tang computing x·sin(x) in your browser. The Rust code compiled to WebAssembly, running the same Dual type from the library. Drag the point along the curve and watch the tangent line follow, derivative exact at every position:
Type an expression and watch the dual numbers propagate through every operation. Change x. Try the presets. The derivative is always exact.
This is forward-mode AD, great when you have few inputs. For millions of parameters (training neural nets), you want reverse-mode. That's backpropagation. Same mathematical insight, the chain rule applied mechanically, but propagating backward through a recorded tape. tang has both: Dual<S> for forward-mode, tang-ad for reverse-mode with grad, jacobian, hessian.
The type trick
Here's where it gets interesting.
What if your math library was generic over the number type? Not just f64, any type that implements add, multiply, sin, sqrt. A trait. Call it Scalar:
fn spring_energy<S: Scalar>(pos: Vec3<S>, rest_len: S) -> S {
let stretch = pos.norm() - rest_len;
stretch * stretch * S::from_f64(0.5)
}Call with f64: plain evaluation. Call with Dual<f64>: exact gradient. Same function. Same code. The derivative is free. Just change the type parameter.
I built this. tang. A Rust math library where every type (Vec3, Mat4, Quat, Transform, DMat, DVec, LU, SVD, Cholesky, eigensolvers) is generic over Scalar. 18 crates. ~50,000 lines. 468 tests. no_std throughout, MIT licensed.
Write your physics once. Get exact derivatives by swapping the type.
What does that buy you? Watch two gradient descent particles race down Himmelblau's function. The green particle uses the exact gradient from dual numbers. The red particle uses finite differences. Same learning rate, same start. Drag the h slider to see how step size changes the FD path. Click anywhere to restart from a new position:
What you get
tang's decompositions are generic over Scalar. This is the thing nalgebra can't do. Dual<f64> flows through LU, SVD, Cholesky, QR, eigensolvers. You can differentiate through a linear solve:
let a = DMat::<Dual<f64>>::from_fn(3, 3, |i, j| /* ... */);
let b = DVec::<Dual<f64>>::from_vec(vec![/* ... */]);
let x = Lu::new(a).solve(&b);
// x[i].dot = ∂solution/∂(whatever you seeded)Derivatives of the solution, through the decomposition, exact. No adjoint method by hand. No implicit function theorem. Just change the type.
nalgebra's decompositions are specialized for f64. They cannot accept Dual<f64>. That's not a criticism, it's a different design. But if you need gradients through your solver, you need the generics.
Dual-number LU solve: 81ns. Finite differences for the same derivative: 167ns. 2x faster and exact.
Feel the derivative
A spring. Force F = −k(x − L). Two derivatives come free from the dual numbers: ∂F/∂k tells you how the force changes if you stiffen the spring. ∂F/∂L tells you how it changes if you lengthen the rest position. Drag the sliders and watch the arrows:
These are the sensitivities an optimizer needs. Not "how does the output change if I nudge the input by some small h" but how does it change exactly, analytically, through every operation in the computation.
18 crates
Two trees converging. Left: algebra (vectors, matrices, decompositions, autodiff, optimization). Right: symbolic computation (expression DAGs, GPU compilation, n-d arrays). They meet at training and distributed compute.
tang-train has 40+ layers from Linear through Transformers with GQA, SwiGLU, RotaryEmbedding, LoRA. Hand-(claude)-written backward passes, no tape overhead. AdamW with state persistence. Learning rate schedulers. A PINN module for physics-informed neural networks.
tang-gpu traces Rust math into symbolic DAGs via tang-expr, differentiates symbolically, and compiles to Metal/Vulkan/DX12 through wgpu. tang-compute writes native Metal and CUDA kernels for the hot paths: flash attention, fused activations, simdgroup matrix acceleration.
tang-mesh ships expression graphs over QUIC, not tensors. Each worker compiles locally to its own backend. Your M3 compiles to Metal, a 4090 compiles to CUDA. Same graph, different hardware. Data-parallel, pipeline-parallel, tensor-parallel.
Honest about speed
tang's dense LA is pure generic Rust. nalgebra dispatches to BLAS. The gap is real:
| Op (n=128) | tang | nalgebra |
|---|---|---|
| GEMM | 250µs | 84µs |
| LU solve | 118µs | 107µs |
tang works with Dual<f64>, intervals, anything implementing Scalar. nalgebra's fast path is f64-only. Different tradeoff. Enable tang's faer feature for Apple Accelerate BLAS when you need it.
For geometry primitives, the stuff that actually runs in inner loops:
| Op | tang | nalgebra | glam (f32) |
|---|---|---|---|
| vec3 dot | 2.2ns | 2.2ns | 1.6ns |
| quat slerp | 8.1ns | 11.0ns | 6.3ns |
| mat4 inverse | 12.0ns | 16.0ns | 8.8ns |
Competitive despite full generics. The Scalar abstraction costs almost nothing at the call site.
tang is the math backbone of gaia, a 7B-parameter world model I'm training for physical design. The simulation that generates labels and the network that learns from them use the same types, the same Scalar trait, the same code. That's the whole point. When the derivative is free, the line between simulation and learning disappears.
cargo add tang tang-la tang-addocs · examples · source
