Go back

Article

Jun 10, 2026

Aurora - Autonomous Mecanum-Drive Mobile Robo

full-stack autonomous robot from the ground up

Why this project exists

The goal was simple to say and brutal to build: take a 4-wheel mecanum chassis, give it a brain split across an ESP32 microcontroller and a Raspberry Pi 5, and have it autonomously explore an unknown room, build a map, and return home — all running ROS2 Jazzy on real hardware, not in simulation.

What follows is the real story of how that happened: the hardware, the architecture, the two-day debugging sessions, the bugs that looked like one thing and turned out to be another, and the final state machine that makes the robot resilient enough to survive hitting a wall, getting stuck, and being told to go home and come back.

The Hardware

Component	Model / Spec	Role
MCU	Waveshare ESP32-P4-WIFI6 (ESP32-P4 RISC-V main core + ESP32-C6 WiFi co-processor over internal SDIO)	Real-time motor + sensor I/O
Motor Drivers	2× Cytron MDD10A (dual H-bridge, 10A/ch, 6–24V)	Drive 4 motors
Motors	4× PGM45775-19.2K (12V, 19.2:1 gearbox, quadrature encoder, 537.6 counts/output rev)	Wheel actuation
IMU	ISM330DHCX, SPI	Inertial sensing (Madgwick → EKF)
LiDAR	Slamtec C1M1 R2, USB, 12m, 360°, 10Hz	Primary SLAM sensor
Compute	Raspberry Pi 5 8GB, Ubuntu 24.04, ROS2 Jazzy	SLAM, Nav2, exploration, mission control
Dev machine	WSL2 Ubuntu 22.04	Firmware builds, ROS2 CLI, code

The chassis is a classic X-pattern mecanum layout — front-left and rear-right wheels have left-hand rollers, front-right and rear-left have right-hand rollers — which is what enables holonomic motion (strafe, diagonal, rotate-in-place) once the motor direction matrix is correct.

The Architecture: A Split Brain

The robot's "brain" is deliberately split:

ESP32-P4 runs a tight real-time loop: quadrature encoder decoding via PCNT, MCPWM motor control, and an SPI IMU driver. It talks to the Pi over a USB serial link.
Raspberry Pi 5 runs the entire ROS2 Jazzy stack: ros2_control hardware interface, EKF localization (robot_localization), slam_toolbox for mapping, Nav2 for path planning and obstacle avoidance, explore_lite for frontier-based autonomous exploration, and a custom mission-control node (amr_home_manager) that ties it all together.

This split exists because the Pi 5, despite being powerful for a single-board computer, doesn't have hard real-time guarantees — motor PWM and encoder counting need to live on a microcontroller. Everything that benefits from Python, ROS2's ecosystem, and heavier computation (SLAM, planning) lives on the Pi.

Part 1: The Firmware — A Two-Day Fight With a Silent Hang

Before any ROS2 code existed, the first job was just making the wheels spin reliably. This board — the Waveshare ESP32-P4-WIFI6 — turned out to have an undocumented minefield of reserved GPIO pins:

GPIO 9–13 are wired to an onboard audio codec
GPIO 18–23 are suspected to be the internal SDIO bus to the ESP32-C6 WiFi co-processor
GPIO 24–25 are USB D+/D-
GPIO 28–31 sit in the camera/display zone

Early ESP-IDF firmware would silently hang mid-boot — no panic, no backtrace, no watchdog message, just a frozen UART. What followed was a two-day debugging marathon that touched LEDC PWM channel configs, a full rewrite from LEDC to MCPWM, I2C driver "conflicts" that turned out to be a harmless constructor-time log message, and finally a bare-bones app_main() using only esp_rom_printf to prove the board itself was healthy.

The actual root cause? After two days of GPIO theories, a idf.py fullclean and a clean rebuild fixed it outright — the firmware state had become corrupted from rapid flash-test-modify cycles. Not a hardware bug, not a driver bug, just a stale build artifact masquerading as a silicon-level mystery.

Lesson learned: when a board behaves impossibly and every targeted fix fails, try wiping the build before going deeper into hardware theories. The safe GPIO map discovered during this process — PWM = {5, 32, 33, 52}, DIR = {26, 27, 2, 4} — became the permanent, physically-verified pin assignment for all four motors.

With that resolved, an Arduino sketch took over for early teleop (f/b/l/r plus four diagonal commands and strafing), confirmed working over a simple serial link from the Pi, before the full ESP-IDF firmware (PCNT encoders + MCPWM + SPI IMU + binary serial protocol) came online for the ROS2 phases.

Part 2: Getting Onto ROS2 — `ros2_control` and Mecanum Kinematics

With the firmware producing real encoder counts and accepting motor commands, the next step was wiring a ros2_control hardware interface and a mecanum_drive_controller so that a TwistStamped on /cmd_vel would turn into four wheel velocities.

This is where physical measurement mattered more than any config default. The mecanum forward-kinematics math depends on wheel_separation_x and wheel_separation_y — get these wrong and every "forward" command becomes a curve. They were measured directly off the chassis:

wheel_separation_x = 0.462m → lx = 0.231m
wheel_separation_y = 0.510m → ly = 0.255m
lx + ly = 0.486 (the sum_of_robot_center_projection_on_X_Y_axis parameter)

These numbers are now locked into both controllers.yaml and the URDF, and they're the reason the robot drives in a straight line instead of a banana curve.

Part 3: Localization — Where "2× Under-Scaled Odometry" Broke Everything

This is the bug that, in retrospect, explains almost every weird navigation problem the robot had for weeks.

The encoder math looked right on paper:

7 PPR (motor shaft) × 4 (quadrature, X4 mode) × 19.2 (gear ratio) = 537.6 counts/output rev
RAD_PER_COUNT = 2π / 537.6 = 0.01169 rad/count

But the ESP32 firmware's PCNT peripheral was actually counting in X2 (2-edge) mode, not X4. The real figure was 268.8 counts/rev — exactly half. Every wheel-distance calculation the EKF received was therefore halved: push the robot 0.5m, and /odom/wheel reported ~0.25m.

The downstream effects were chaotic and looked like completely unrelated problems:

SLAM fought the LiDAR scans because odometry disagreed with reality
The map smeared and ballooned
explore_lite chased frontiers that didn't exist
A dozen costmap "band-aid" tuning passes were applied to symptoms, none of which addressed the actual cause

The fix was a one-line constant change — RAD_PER_COUNT = 2π / 268.8 — and a hand-push test confirmed 0.514m measured for a 0.5m push. This single fix retroactively explained most of the "nav chaos" from earlier sessions.

The IMU's turn: signal integrity and yaw

Two more localization bugs surfaced once odometry was fixed:

IMU SPI signal integrity. The ISM330DHCX IMU would intermittently return garbage WHO_AM_I values at 500kHz SPI. A direct signal-integrity test showed 18/30 good reads at 500kHz vs. 30/30 at 100kHz — the jumper wires couldn't reliably carry the clock. Dropping to 100kHz fixed it (IMU only needs ~100Hz of data, so bandwidth wasn't the constraint). Later, after soldering the jumpers directly to the PCB, the SPI bus ran cleanly at 1MHz — confirmed with 30/30 reads — and was set to a conservative 500kHz for thermal/vibration headroom.
EKF yaw fight. With both wheel odometry and the IMU feeding yaw into the EKF, the fused heading jittered — visible in SLAM as a radial "star/fan" smear pattern. Mecanum wheels slip on their rollers, so wheel-derived yaw is unreliable; the fix was to disable yaw and angular-velocity fusion from wheel odometry entirely and let the IMU be the sole yaw source — standard practice for any mecanum platform.
Gyro bias drift. Even with the IMU as sole yaw source, a small but persistent ~0.0077 rad/s Z-axis gyro bias caused slow heading drift (initially misdiagnosed as a Madgwick zeta tuning problem — it wasn't, because in 6-DOF mode without a magnetometer, Madgwick is structurally incapable of observing yaw bias). The real fix: a boot-time calibration routine that averages 200 raw gyro samples (~2s) at rest, measures the per-axis bias, and subtracts it from every subsequent reading before it ever reaches Madgwick or the EKF.
TF tree conflict. robot_state_publisher was broadcasting a static base_footprint → base_link chain from the URDF, while the EKF was also configured to publish odom → base_link — giving base_link two parents, which tf2 cannot resolve. Fix: EKF's base_link_frame set to base_footprint, so the EKF publishes odom → base_footprint and the URDF's static chain extends it cleanly from there.

The end state: /odom at a clean 50Hz, fed by a properly-scaled wheel encoder model and an IMU with a calibrated, soldered, signal-clean SPI link feeding a single source of truth for yaw.

Part 4: SLAM — Seeing the Room

With localization solid, slam_toolbox (online async mode, driven through nav2_lifecycle_manager) turned LiDAR scans into a live occupancy grid. One more physical-measurement bug surfaced here too: the LiDAR's TF offset (x = 0.327m, the center of the scan disc relative to base_link) had to be measured directly off the mounted hardware — and in a later session, the LiDAR itself was discovered to be mounted 180° backwards, requiring a URDF rpy correction. Once both were right, RViz2 showed a coherent room map with clean wall lines.

Part 5: Nav2 — Teaching a Holonomic Robot to Plan

Nav2 brought its own holonomic-specific challenges:

SmacPlannerLattice needs precomputed motion-primitive files matched to the robot's motion model. The official omni (holonomic) primitive set — 144 trajectories across 16 headings at 5cm resolution — was the right match for mecanum and didn't need regeneration.
MPPI controller was configured in Omni motion mode, the only mode that lets it actually use the mecanum drive's strafing capability.
A missing inflation_layer on the global costmap caused planner_server to spam warnings on every single planning cycle and degraded path quality — an easy miss since the costmap "worked" without it, just badly.
Initially, the planner went lattice-based path failures because the controller had been switched to a differential-drive assumption while the global planner still assumed holonomic motion — geometrically valid paths that the controller couldn't track. Switching to SmacPlanner2D (no kinematic assumptions, plain 8-connected grid search) fixed it.

The Pi 5's CPU budget

MPPI is computationally expensive, and the Pi 5 has a real budget: a 20Hz control loop means 50ms per cycle. The initial MPPI config (batch_size=2000, time_steps=56) cost 62–125ms per cycle — the control loop was running at 8–16Hz, and the robot lurched and jerked because of it (independent of any localization issue).

The fix was a deliberate trade: batch_size 2000→1000, time_steps 56→40, model_dt 0.05→0.07. New cost: 1000×40 = 40,000 sample-steps, about 0.36× the old load — comfortably inside the 20Hz budget — while keeping the planning horizon at the same 2.8 seconds (40 × 0.07s). Per-step displacement (0.10 m/s × 0.07s = 0.007m) stayed well under the 0.05m costmap cell size, so no collision-check resolution was lost.

A second round of tuning (vx_max/vy_max 0.18→0.10 m/s, wz_max 0.7→0.4 rad/s, inflation radius 0.36→0.45m) slowed the robot further for safer supervised testing and earlier wall-avoidance curving.

Part 6: Autonomous Exploration — The Premature-Stop Saga

This is where the project spent the most debugging time, because the symptom — "the robot explores a little bit and then just... stops" — had three completely different root causes across multiple sessions, each one masking the next.

Root cause 1: the wrong global planner (fixed earlier, in Part 5)

Root cause 2: a costmap QoS race that explore_lite never recovered from

nav2_costmap_2d's Costmap2DPublisher only sends a full occupancy grid on /global_costmap/costmap once, on activation — after that, only incremental /global_costmap/costmap_updates messages go out. That one-shot full publish uses TRANSIENT_LOCAL QoS, but explore_lite's costmap subscriber uses the rclpy/rclcpp default (VOLATILE) QoS. In the staggered launch sequence, global_costmap activates around T+19s but explore_lite doesn't start until T+23s — by the time its VOLATILE subscription joins, the one full-grid message is long gone, and TRANSIENT_LOCAL replay doesn't help VOLATILE subscribers.

Result: explore_lite sat forever on "Waiting for costmap to become available," the robot never moved, and every prior "autonomous explore" run had silently been doing nothing.

The fix was one parameter: always_send_full_costmap: true on the global costmap, forcing a full-grid publish every cycle (1Hz) regardless of when a subscriber joins.

Root cause 3: the state machine was never actually "exploring"

Even after the costmap fix got the robot moving, the home_manager's recovery logic — the stall watchdog, the recovery-spin handler — never fired. Why? explore_lite starts itself autonomously the moment it launches; it doesn't wait for a command. But explore_map.launch.py never published /amr/command "explore", so home_manager's internal state stayed IDLE forever. All of its State.EXPLORING-gated logic — including everything meant to recover from "no frontiers found" — had been completely inert this whole time, on every previous run.

The fix: home_manager now listens to /explore/status itself, and transitions IDLE → EXPLORING when it sees EXPLORATION_STARTED or EXPLORATION_IN_PROGRESS — mirroring explore_lite's actual autonomous behavior instead of waiting for a command that was never going to come.

The payoff

With all three fixes in place, a full run completed: the robot drove through three preempted goals, the costmap grew from 203×217 to 391×315 cells (~19.5m × 15.75m at 5cm/pixel) — real room-scale exploration, not the old ~1.5m stop-after-one-frontier behavior. A recovery spin (a full 360° Nav2 Spin action to sweep the LiDAR across new territory) fired correctly when the first frontier search came up empty, and the map saved successfully.

Part 7: Making Exploration Resilient — Retry, STUCK, and Coming Home

The recovery-spin design worked, but it had a hard ceiling: after 2 spins with no new frontiers, the robot gave up permanently and called it "fully explored" — even if most of the room was still unmapped. Live testing also surfaced a sharper problem: when the robot drove into a wall, explore_lite's synchronous makePlan() inside resume() would immediately re-evaluate the same (unchanged) costmap, immediately report "no frontiers" again, and the home_manager would immediately re-nudge it — producing a tight ~20ms busy-loop, spamming logs and burning CPU on a Pi that has a documented brownout history under load.

The redesign that followed replaced the whole "give up after N spins" model with something closer to how a person would actually operate the robot:

Persistent retry, not termination. On "no frontiers found," home_manager never gives up on its own anymore — it just re-publishes /explore/resume=True and keeps trying. A 2-second debounce timer breaks the busy-loop: instead of re-nudging instantly (which just re-triggers the same synchronous "no frontiers" result), it waits 2 seconds before retrying. Only an explicit operator command ends exploration now.
A new STUCK state. The existing stall watchdog (15s of no /odom motion → nudge /explore/resume) now escalates: after 15 nudges with still no motion, the robot transitions to STUCK and logs a prompt for the operator to intervene with stop or go_home.
Continuous path recording. While EXPLORING or STUCK, the robot samples its (x, y, yaw) from /odom every 0.5m into an in-memory path, seeded with its home pose the moment exploration starts.
A real operator vocabulary. /amr/command now accepts:
- stop — save the map (via slam_toolbox/save_map) and the recorded path (as JSON) to disk, then go IDLE.
- go_home — save progress, then retrace the recorded path in reverse via sequential NavigateToPose waypoints, returning to the starting pose.
- resume — pause exploration, retrace the path forward from home back to the saved breakpoint, then resume exploring from exactly where it left off.
Reentrancy guards. go_home/resume no-op safely if already mid-retrace; stop during a retrace logs that a retrace is in progress instead of a misleading "already idle"; and arriving back at a breakpoint via resume resets all the stall-watchdog bookkeeping so a stale STUCK escalation can't desync the state machine after a round trip.

This was built test-first — 58/58 unit tests passing against a mocked rclpy/ROS2 environment — then validated live on the Pi, where it immediately caught one more real bug: a launch file was passing the literal string ~/AMR/maps/explore_map (with an un-expanded ~) as the map-save path parameter, which open() doesn't expand. A small _resolve_map_save_path() helper that calls os.path.expanduser() before any file I/O fixed it.

What This Project Actually Teaches

A few patterns showed up again and again, across firmware and ROS2 alike:

Measure the physical robot, don't trust the placeholder. Wheel separation, LiDAR offset, encoder counts-per-rev, GPIO safety zones — every one of these started as a documented "best guess" and had to be replaced with a number taken off the real hardware before anything worked correctly.
One root cause can wear many masks. The 2× odometry scaling bug produced SLAM smearing, explore_lite chasing phantom frontiers, and a dozen costmap tuning passes that treated symptoms. The premature-stop bug had three independent root causes stacked on top of each other, each one hiding the next until the previous was fixed.
"It looks like it's working" isn't validation. Recovery-spin logic, a stall watchdog, and an entire state machine all sat completely inert for sessions because _state never left IDLE — the robot looked like it was exploring autonomously (because explore_lite genuinely was), but none of the home_manager's safety nets were actually armed.
Resource-constrained compute changes the right answer. MPPI's "default" parameters were correct for a desktop; on a Pi 5 they blew the control-loop budget by 2-3×. The fix wasn't to abandon MPPI, it was to find the parameter combination that preserved the planning horizon and collision-check resolution while fitting the real CPU budget.
When a build behaves impossibly, suspect the build before the silicon. Two days of GPIO-conflict theories were ultimately moot — a clean rebuild fixed a "hardware bug" that was actually stale build state.

Current Status

As of now, the full pipeline works end-to-end on real hardware:

✅ ESP32 firmware: encoders, MCPWM motor control, SPI IMU, binary serial protocol
✅ ros2_control mecanum drive, calibrated kinematics
✅ EKF localization at 50Hz, calibrated gyro, soldered SPI IMU at 500kHz
✅ slam_toolbox live mapping with a coherent, non-smeared map
✅ Nav2 (SmacPlanner2D + MPPI Omni) tuned to fit the Pi 5's real-time budget
✅ Autonomous frontier exploration that doesn't give up prematurely
✅ A resilient mission-control state machine: persistent retry, STUCK escalation, path recording, and go_home/resume retrace — 58/58 tests passing, validated live

What's left is mostly polish: confirming slam_toolbox/save_map is robust under load, running a true open-space exploration pass to assess map quality (wall straightness, loop closure), and the long-standing item of soldering the remaining sensor jumpers for good measure.

The throughline across all of it: a robot that explores a room on its own, knows when it's stuck, and can be told to come home and pick up where it left off — built from a microcontroller that wouldn't boot, an encoder that was off by exactly 2×, and a CPU budget that had to be respected one parameter at a time.