A Number with a Shadow

hi. let me tell you about a number trick.

You can measure a curve with a ruler. Lay it across two points, read the slope. Zoom in and the ruler shrinks — better approximation. Keep zooming and your hands start shaking. At some point you're measuring the shake, not the curve.

measuring...

That's not a metaphor. That's literally what happens when you compute a derivative with floating point arithmetic:

let dfdx = (f(x + h) - f(x)) / h;

Pick a small h. Evaluate twice. Divide. It worked. Then I made h smaller and it worked better. Then I made h way smaller and it got worse.

 

The step-size trap

Here's sin(x) at x = 1. The green line is the true tangent, slope cos(1) ≈ 0.54, exact. The red line is the secant from finite differences, connecting (x, sin x) and (x+h, sin(x+h)), slope approximated by dividing.

Drag h down. Watch the red line approach the green one. Now keep going past 10⁻⁸. Watch it fall apart.

1234-0.50.51tangent (exact)secant (finite diff)sin(x)
h100.00
tangent = 0.540302secant = -0.003894error = 5.44e-1
the ruler is shorter than the smallest scratch it can see.

This is catastrophic cancellation. At tiny h, f(x + h) and f(x) are nearly identical in floating point. Subtract them and you're left with noise. Divide that noise by h and you amplify it. At h = 10⁻¹⁵ the secant is computing its slope from two points that are the same number in f64.

There's a sweet spot around h ≈ 10⁻⁸ where truncation error and cancellation error balance out. Here's the V-shaped trap. Drag h to explore it — then switch to the pathological function and watch the sweet spot collapse:

10⁻¹²10⁻⁸10⁻⁴10⁰10⁻¹⁶10⁻¹²10⁻⁸10⁻⁴10⁰step size h|error|dual numbercancellationtruncation
h1e-5
approx = 25.000090exact = 25.000000error = 9.00e-5sweet spot: h ≈ 10⁻⁸ → error ≈ 10⁻⁸

The polynomial gets to ~10⁻⁸ error at the sweet spot. The pathological function barely hits 10⁻⁵. And the green dashed line at the bottom? Still exact. Still no h.

The idea has been around since 1964.

 

 

A number with a shadow

wait. what if the derivative just... came along?

A dual number is a pair: a value and its derivative. The value is the number you know. The derivative is its shadow, tracking how the number changed to get where it is.

The arithmetic rules are the rules you already know from calculus:

Addition: (a, a') + (b, b') = (a + b, a' + b') Derivatives add.

Multiplication: (a, a') × (b, b') = (a·b, a·b' + a'·b) That's the product rule. Literally the product rule from calc.

sin: sin(a, a') = (sin a, a' · cos a) sqrt: √(a, a') = (√a, a' / 2√a) Chain rule. Both of them.

Every operation carries the derivative forward using a rule from first-semester calculus. No limits. No step sizes. No approximation. The derivative just comes along for the ride.

 

Watch it work

f(x) = x³ − 2x + 1 at x = 3. Seed x = (3, 1), value 3, derivative 1 because ∂x/∂x = 1. Click through and watch the chain rule propagate:

f(x) = x³ − 2x + 1  at  x = 31/6
x3, 12xx³−2xx³−2x+1
seedseed: ∂x/∂x = 1

Check: f'(x) = 3x² − 2. At x = 3: 3(9) − 2 = 25. The chain rule did all the work, one node at a time.

 

25 lines of Rust

Here's a complete dual number type:

#[derive(Clone, Copy)]
struct Dual { val: f64, dot: f64 }

impl std::ops::Add for Dual {
    type Output = Self;
    fn add(self, rhs: Self) -> Self {
        Dual { val: self.val + rhs.val, dot: self.dot + rhs.dot }
    }
}

impl std::ops::Mul for Dual {
    type Output = Self;
    fn mul(self, rhs: Self) -> Self {
        // product rule: d(ab) = a·db + da·b
        Dual {
            val: self.val * rhs.val,
            dot: self.val * rhs.dot + self.dot * rhs.val,
        }
    }
}

// Sub, Neg: same pattern. (a,a')-(b,b') = (a-b, a'-b')

impl Dual {
    fn sin(self) -> Self {
        Dual { val: self.val.sin(), dot: self.dot * self.val.cos() }
    }
    fn sqrt(self) -> Self {
        let s = self.val.sqrt();
        Dual { val: s, dot: self.dot / (2.0 * s) }
    }
}

Any function built from these operations gives you exact derivatives. No tape. No computation graph. No framework. Two floats instead of one.

Here's tang computing x·sin(x) in your browser. The Rust code compiled to WebAssembly, running the same Dual type from the library. Drag the point along the curve and watch the tangent line follow, derivative exact at every position:

tang running in your browser via WebAssembly
2468x·sin(x)loading wasm...
x
3.000
f(x)
0.0000
f'(x)
0.0000
let d = Dual::var(x);
let y = d * d.sin();
// y.val = f(x), y.dot = f'(x)
that's real Rust, running in your browser right now.

Type an expression and watch the dual numbers propagate through every operation. Change x. Try the presets. The derivative is always exact.

f(x) =x =
expressionvaluederivativerule
x21seed variable
(x · x)44product rule
sin((x · x))-0.7568-2.6146chain rule (sin)
f(2) = -0.7568f′(2) = -2.6146

This is forward-mode AD, great when you have few inputs. For millions of parameters (training neural nets), you want reverse-mode. That's backpropagation. Same mathematical insight, the chain rule applied mechanically, but propagating backward through a recorded tape. tang has both: Dual<S> for forward-mode, tang-ad for reverse-mode with grad, jacobian, hessian.

 

 

The type trick

Here's where it gets interesting.

What if your math library was generic over the number type? Not just f64, any type that implements add, multiply, sin, sqrt. A trait. Call it Scalar:

fn spring_energy<S: Scalar>(pos: Vec3<S>, rest_len: S) -> S {
    let stretch = pos.norm() - rest_len;
    stretch * stretch * S::from_f64(0.5)
}

Call with f64: plain evaluation. Call with Dual<f64>: exact gradient. Same function. Same code. The derivative is free. Just change the type parameter.

I built this. tang. A Rust math library where every type (Vec3, Mat4, Quat, Transform, DMat, DVec, LU, SVD, Cholesky, eigensolvers) is generic over Scalar. 18 crates. ~50,000 lines. 468 tests. no_std throughout, MIT licensed.

Write your physics once. Get exact derivatives by swapping the type.

go green go! ...wait, they both started at the same place.

What does that buy you? Watch two gradient descent particles race down Himmelblau's function. The green particle uses the exact gradient from dual numbers. The red particle uses finite differences. Same learning rate, same start. Drag the h slider to see how step size changes the FD path. Click anywhere to restart from a new position:

● dual (exact)
● FD (h=0.50)
step 0
h0.50
click to move start

 

 

What you get

tang's decompositions are generic over Scalar. This is the thing nalgebra can't do. Dual<f64> flows through LU, SVD, Cholesky, QR, eigensolvers. You can differentiate through a linear solve:

let a = DMat::<Dual<f64>>::from_fn(3, 3, |i, j| /* ... */);
let b = DVec::<Dual<f64>>::from_vec(vec![/* ... */]);
let x = Lu::new(a).solve(&b);
// x[i].dot = ∂solution/∂(whatever you seeded)

Derivatives of the solution, through the decomposition, exact. No adjoint method by hand. No implicit function theorem. Just change the type.

nalgebra's decompositions are specialized for f64. They cannot accept Dual<f64>. That's not a criticism, it's a different design. But if you need gradients through your solver, you need the generics.

Dual-number LU solve: 81ns. Finite differences for the same derivative: 167ns. 2x faster and exact.

 

Feel the derivative

A spring. Force F = −k(x − L). Two derivatives come free from the dual numbers: ∂F/∂k tells you how the force changes if you stiffen the spring. ∂F/∂L tells you how it changes if you lengthen the rest position. Drag the sliders and watch the arrows:

F = 0.00∂F/∂k = 0.00∂F/∂L = 0.00
k3.0
L120px
∂F/∂k says: if you stiffen the spring, I bounce harder ;)

These are the sensitivities an optimizer needs. Not "how does the output change if I nudge the input by some small h" but how does it change exactly, analytically, through every operation in the computation.

 

 

18 crates

tangtang-latang-sparsetang-adtang-optimtang-traintang-exprtang-gputang-computetang-tensortang-meshtang-safetensorstang-hubalgebrasymbolic
hover a crate

Two trees converging. Left: algebra (vectors, matrices, decompositions, autodiff, optimization). Right: symbolic computation (expression DAGs, GPU compilation, n-d arrays). They meet at training and distributed compute.

tang-train has 40+ layers from Linear through Transformers with GQA, SwiGLU, RotaryEmbedding, LoRA. Hand-(claude)-written backward passes, no tape overhead. AdamW with state persistence. Learning rate schedulers. A PINN module for physics-informed neural networks.

tang-gpu traces Rust math into symbolic DAGs via tang-expr, differentiates symbolically, and compiles to Metal/Vulkan/DX12 through wgpu. tang-compute writes native Metal and CUDA kernels for the hot paths: flash attention, fused activations, simdgroup matrix acceleration.

tang-mesh ships expression graphs over QUIC, not tensors. Each worker compiles locally to its own backend. Your M3 compiles to Metal, a 4090 compiles to CUDA. Same graph, different hardware. Data-parallel, pipeline-parallel, tensor-parallel.

Honest about speed

tang's dense LA is pure generic Rust. nalgebra dispatches to BLAS. The gap is real:

Op (n=128)tangnalgebra
GEMM250µs84µs
LU solve118µs107µs

tang works with Dual<f64>, intervals, anything implementing Scalar. nalgebra's fast path is f64-only. Different tradeoff. Enable tang's faer feature for Apple Accelerate BLAS when you need it.

For geometry primitives, the stuff that actually runs in inner loops:

Optangnalgebraglam (f32)
vec3 dot2.2ns2.2ns1.6ns
quat slerp8.1ns11.0ns6.3ns
mat4 inverse12.0ns16.0ns8.8ns

Competitive despite full generics. The Scalar abstraction costs almost nothing at the call site.

 

 

no more picking h.

tang is the math backbone of gaia, a 7B-parameter world model I'm training for physical design. The simulation that generates labels and the network that learns from them use the same types, the same Scalar trait, the same code. That's the whole point. When the derivative is free, the line between simulation and learning disappears.

cargo add tang tang-la tang-ad
tang.toys

docs · examples · source