_ Considered Harmful
February 21, 2026This match block ran my CAD kernel. One arm handled every surface pair the kernel didn't have an analytic solution for. It was supposed to be a rare fallback. It wasn't rare. It was 82% of all CPU time.
Dragging a slider took 32 seconds to update (given the right features). The fix was 22 lines: 183x faster! But I had to stop guessing first.
The setup
vcad is a parametric CAD app I'm building. The geometry kernel is written in Rust, compiled to WebAssembly, and runs on a Web Worker so the UI stays responsive. When you drag a slider, the kernel re-evaluates the entire model: sweep operations, boolean clash detection, tessellation, serialization back to JavaScript.
I'd already fixed a 15-second main thread freeze by moving evaluation to a worker. The UI was smooth now. But the kernel itself was taking ~600ms per evaluation, which meant parameter sliders felt laggy. And when two parts overlapped (triggering clash detection), it got much worse.
How much worse? I didn't know. The entire Rust kernel had zero timing instrumentation. No visibility at all.
How a CAD kernel represents shapes
Before diving into the optimization, some context on how the kernel works.
CAD kernels don't store triangle meshes. They store exact mathematical surfaces: a cylinder is an origin, axis, and radius. A plane is a point and a normal. A B-Rep solid is a collection of faces, each backed by one of these surfaces, stitched together by edges and vertices into a watertight shell.
I wrote about building the kernel from scratch, and how surfaces map to UV space for tessellation.
This means operations like "where do these two solids intersect?" reduce to "where do their surfaces intersect?" A plane intersecting a cylinder is an ellipse. Closed-form math, O(1). The kernel has analytic solutions for common pairs: plane-plane, plane-cylinder, plane-sphere, and so on.
But not every surface pair has an analytic solution. For exotic pairs (say, two NURBS surfaces), the kernel falls back to numerical sampling: evaluate a grid of points on surface A, search for closest matches on surface B. Slow but general. This fallback is called SSI, and it's meant to be a rare last resort.
The question this post answers: what happens when it's not rare?
Step 0: see before you touch
I added a Clock trait to the evaluator. Platform-agnostic (same code runs in the CLI), with a WASM implementation that calls performance.now(). Now every evaluation prints a timing breakdown:
[ENGINE] total:614ms | sweep:312ms tess:89ms clash:145ms
Translate#323:412ms > Sweep#320:312ms > Rotate#315:52ms > Scale#310:48ms
Immediately useful. Three things jumped out:
- A Translate, Rotate, and Scale node were each cloning the entire B-Rep solid. Three copies for what should be one matrix multiply.
- Sweep was the biggest single operation at 312ms.
- Clash detection was 145ms but I suspected it got worse with overlapping parts.
The easy win: transform fusion
The evaluator processed transform operations one at a time. A Translate node would clone the solid and translate it. Then a Rotate node would clone that and rotate it. Then Scale would clone that and scale it. Three full B-Rep clones.
But Translate, Rotate, and Scale are all just 4x4 matrices. They compose. So I wrote a function that walks up the DAG collecting consecutive transforms and composes them into one:
fn collect_transform_chain(op: &CsgOp, nodes: &HashMap<NodeId, Node>) -> (Transform, NodeId) {
let mut composed = Transform::identity();
let mut current_op = op;
loop {
match current_op {
CsgOp::Translate { child, offset } => {
composed = composed.then(&Transform::translation(offset.x, offset.y, offset.z));
// If the child is also a transform, keep composing
match nodes.get(child) {
Some(node) if is_transform_op(&node.op) => current_op = &node.op,
_ => return (composed, *child),
}
}
// Same for Rotate, Scale...
}
}
}One clone instead of three. The Rotate and Scale nodes vanished from the timing output entirely. ~150ms saved.
FxHash: the constant-factor win
I wanted to see actual Rust function names in the browser profiler. Safari's Web Inspector can profile Web Workers, but by default wasm-opt strips all the debug info from the binary. Two lines in Cargo.toml fixed that:
[package.metadata.wasm-pack.profile.profiling]
wasm-opt = ['-O', '--debuginfo']
Now wasm-pack build --profiling produces a binary with function names intact. I hit record, dragged the slider, and got a real flame graph with Rust function names.
The sweep operation's hot loop builds a half-edge map keyed by quantized [i64; 3] coordinates. The profiler showed hash_one and DefaultHasher::write at 17% of sweep time. Rust's default hasher is SipHash, which is great for hash-flooding resistance and terrible for integer keys where you don't need that.
// Before: SipHash (cryptographic, slow for integers)
let mut he_map: HashMap<([i64; 3], [i64; 3]), HalfEdgeId> = HashMap::new();
// After: FxHash (multiplicative, fast for integers)
let mut he_map: FxHashMap<([i64; 3], [i64; 3]), HalfEdgeId> = FxHashMap::default();Hash overhead dropped from 17% to 8%. Steady-state eval went from ~600ms to ~165ms. Getting somewhere.
The wall
Then I captured a trace during clash detection with two overlapping parts. 32 seconds.
Here's what the profiler showed:
| Function | Self-time | % |
|---|---|---|
marching_ssi | 26,462ms | 82.0% |
classify_face | 1,048ms | 3.2% |
sweep | 773ms | 2.4% |
memcmp | 431ms | 1.3% |
| everything else | 3,545ms | 11.1% |
One function. 82% of all CPU time. Not a hot loop. Not a thousand small things. One function eating 26 seconds.
What is marching_ssi
When two solids overlap, the boolean pipeline needs to find where their surfaces intersect. For common surface pairs (plane-plane, plane-cylinder, plane-sphere) there are closed-form analytic solutions. Plane meets cylinder? That's a circle, or two lines, or an ellipse. O(1). Done.
But when the kernel doesn't have an analytic solution for a surface pair, it falls back to marching_ssi: a brute-force sampling method. It evaluates an n x n grid on surface A, and for each point, searches a 16 x 16 grid on surface B for the closest point. Total work per face pair:
With n=32 (the default), that's 263,168 surface evaluations per face pair.
The kernel had analytic SSI for planes, cylinders, spheres, cones, and tori. Here's the dispatch:
match (a.surface_type(), b.surface_type()) {
(Plane, Plane) => plane_plane(a, b), // O(1)
(Plane, Cylinder) => plane_cylinder(a, b), // O(1)
(Plane, Sphere) => plane_sphere(a, b), // O(1)
(Sphere, Sphere) => sphere_sphere(a, b), // O(1)
// ... more analytic pairs ...
_ => marching_ssi(a, b, 32), // O(n^2) !!!!
}See that _ arm? That's the fallback. Any surface type not explicitly handled falls through to brute-force sampling.
The missing arm
Sweep operations produce faces using BilinearSurface, a bilinearly interpolated quad. It's basically a plane with a slight twist. The kernel had an is_planar() method that checks if all four corners are coplanar (they almost always are for sweep faces).
But BilinearSurface was not in the SSI dispatch table. It fell through to _. Every single face pair involving a swept surface went through marching_ssi.
A swept solid with 32 path segments and 20 profile vertices has ~640 BilinearSurface faces. Clash detection between two such solids tests hundreds of overlapping face pairs. Each one doing 263K surface evaluations. That's where the 32 seconds came from.
The fix was to approximate each BilinearSurface as a plane and re-dispatch:
impl BilinearSurface {
pub fn to_approximate_plane(&self) -> Option<Plane> {
let centroid = Point3::new(
0.25 * (self.p00.x + self.p10.x + self.p01.x + self.p11.x),
0.25 * (self.p00.y + self.p10.y + self.p01.y + self.p11.y),
0.25 * (self.p00.z + self.p10.z + self.p01.z + self.p11.z),
);
let e1 = self.p10 - self.p00;
let e2 = self.p01 - self.p00;
let n = e1.cross(&e2);
if n.norm() < 1e-15 { return None; }
let normal = Dir3::new_normalize(n);
let x_dir = e1.normalize();
let y_dir = normal.as_ref().cross(&x_dir);
Some(Plane::new(centroid, x_dir, y_dir))
}
}And in the dispatcher:
(SurfaceKind::Bilinear, _) => {
if let Some(plane) = downcast_bilinear(a).and_then(|b| b.to_approximate_plane()) {
intersect_surfaces(&plane, b) // hits Plane x Plane, Plane x Cylinder, etc.
} else {
marching_ssi(a, b, 16)
}
}Now every BilinearSurface face pair resolves through analytic O(1) intersection. marching_ssi disappeared from the profile entirely.
The result
| Metric | Before | After |
|---|---|---|
| Steady-state eval | 600ms | 100ms |
| Clash detection | 32,000ms | 175ms |
marching_ssi share | 82% | 0% |
After the fix, the profiler shows a well-distributed breakdown with no single bottleneck:
| Component | % |
|---|---|
| Sweep construction | 14.1% |
| Face classification | 10.3% |
| memcmp | 8.4% |
| Tessellation | 7.5% |
| Memory allocation | 8.9% |
That's what a healthy profile looks like. No function above 15%. Lots of small things. Hard to speed up further without algorithmic changes to individual components.
The lesson is always the same
I wasted time before I had data. I tried adding an AABB bounding box early-exit to clash detection. It made things worse (the bounding box computation itself tessellated non-planar surfaces, which is the thing I was trying to avoid). I tried lowering tessellation quality. Marginal improvement. I was guessing.
Then I opened the profiler and the answer was staring at me in 72-point font. 82%. One function. The fix was 22 lines.
Here's the breakdown of what each optimization contributed:
The constant-factor stuff was nice. The algorithmic fix was everything. No amount of micro-optimization, no SIMD, no parallelism, no clever memory layout would have come close to "stop doing 263,000 evaluations when you could do 1."
If your program is slow and you haven't profiled it, you're writing fiction. Profile first. The answer is usually one function, one match arm, one missing case that sends your fast path into your slow path. Find it, fix it, go home.
A language design aside
This bug made me think about _ differently.
Both Rust and Loon (a language I'm building) have exhaustiveness checking. If you match on an enum and miss a variant, the compiler tells you. But _ opts out. It's a blanket catch-all, and the compiler stops caring what it catches.
When I added BilinearSurface to the kernel, Rust was happy. The _ arm covered it. No warning, no error. The new variant silently fell into a code path that does 263,168 evaluations instead of 1.
So I made Loon do something Rust doesn't. As of this commit, when you write _ on a known type, the compiler tells you exactly which constructors it catches:
W0100: _ catches 2 constructors of SurfaceKind: BSpline, Bilinear
why: adding a variant to `SurfaceKind` will silently fall into this arm
fix: add explicit arms for each constructor, or keep _ if intentional
When you add Bilinear to the type, the warning changes from "catches: BSpline" to "catches: BSpline, Bilinear." You notice. You think about whether the fallback makes sense for the new variant. The _ arm is still convenient, but it's transparent instead of silent.
Rust's #[non_exhaustive] does something similar for cross-crate enums. But it doesn't help within your own crate, where most match bugs live. Loon's exhaustiveness checker already tracks covered constructors. Surfacing what _ catches was a small addition to the checker, and it would have caught this exact bug.
The full optimization work happened across a few sessions, all with Claude. Transform fusion, FxHash, WASM profiling setup, the BilinearSurface insight, a tessellation caching fix in the boolean pipeline. It felt less like "pair programming" and more like having someone who could hold the whole flame graph in their head while I pasted Safari traces.
vcad is open source if you want to poke around. Sliders are snappy now.
