Hung by a Thread

Late night debugging session with robot on workbench

It's 2am. My robot is frozen. Not crashed, not erroring, just... vibing. Sitting there. Motors off. Completely checked out.

I've been debugging for 8 hours and I'm about to mass delete my entire codebase and become a farmer.

The Setup

I'm building autonomous sidewalk robots. The control loop runs at 100Hz — every 10ms we read sensors, do math, send motor commands. It's the heartbeat. The one thing that absolutely cannot stop.

It had been rock solid for weeks. Then I added LiDAR streaming over WebRTC.

Now, ~16 seconds after a client connects, the loop just stops. Doesn't crash. Doesn't throw. Just ghosts me. The watchdog starts barking, the robot coasts to a stop, and my laptop shows a beautiful 3D point cloud of a robot that has given up on life.

control_loop.rs
iteration:0

The Wrong Turns

I tried everything.

"It's tokio starving the loop" — switched to std::thread::sleep. Nope.

"It's the async mutex" — swapped for std::sync::Mutex. Nope.

"It's running on the wrong thread" — moved the whole loop to std::thread::spawn. Complete isolation. Nope nope nope.

Same freeze. Same spot. Iteration 1,615. Every single time.

The consistency was almost insulting. Like the bug was laughing at me.

The Breakthrough

Ok new plan. I add a heartbeat thread. Just a lil guy that watches a counter and screams if it stops:

std::thread::spawn(move || {
    let mut last = 0;
    loop {
        std::thread::sleep(Duration::from_secs(5));
        let current = counter.load(Ordering::Relaxed);
        if current == last {
            eprintln!("STUCK at iteration {}", current);
        }
        last = current;
    }
});

Five seconds after freeze: STUCK at iteration 1615

Oh. OH. It's not slow. It's not starved. It's blocked. Something is holding a lock and simply not letting go. Deadlock behavior.

Time to bring out the big guns. GDB.

She's waiting on a mutex. But who's holding it??

I scroll through the other threads. Tokio workers, GStreamer stuff, and then... four threads I definitely did not create. Rayon workers. I don't use rayon. Who invited rayon.

The Reveal

Rerun is this beautiful visualization SDK I use for recording telemetry. You call recorder.log() and magic happens.

Turns out rerun uses rayon internally.

And I was calling recorder.log() while holding a mutex.

Control LoopRayon Thread Pool (from rerun)T17🔒R1R2R3R4
Control loop acquires mutex

This is a known rayon footgun: rayon#592. When you call into rayon while holding a mutex, rayon's work-stealing threads can deadlock trying to "help" with work that needs the lock you're already holding.

// Holding mutex while calling rerun
let mut state = shared.lock().unwrap();

// ... doing work ...

if let Some(scan) = &*lidar_rx.borrow() {
    recorder.log_lidar_scan(scan); // DEADLOCK
}

That's it. That's the fix. 8 hours of debugging. 2 lines changed. Hold the lock for less time. Tale as old as time.

The Takeaways

  1. GDB is cracked for deadlocks. Logs can't show you thread state. thread apply all bt hits different.

  2. Random threads showing up? Suspicious. If you see thread pools you didn't spawn, figure out who did.

  3. Your dependencies have dependencies. Somewhere in that Cargo.lock is a threading model waiting to fight yours.

  4. Heartbeat threads are free. A few lines to detect "stuck" is worth it for any critical loop. They're just a lil guy. Let them watch.

  5. The fix is always smaller than the hunt. Always. Without exception. It's almost annoying.

I submitted a PR to rerun adding docs about this. Maybe the next person finds the warning before they find the bug.

The robot runs now. Hasn't frozen since. The LiDAR streams beautifully.

But I will never call into a library I don't fully understand while holding a mutex again. Fool me once.

Robot on workbench the next morning, working again