DOCS ota mechanics
Support# OTA Updates > **Location**: Enterprise crate `bauxite-forge/src/ota/`. Bauxite Forge provides an Over-The-Air update client that checks for updates via gRPC, downloads them in resumable chunks, verifies integrity before restart, and provides rollback, warm-start, and heartbeat monitoring capabilities. ## Update Cycle `OtaManager` runs a background loop (300s poll interval) that: 1. Calls `client.check_for_updates()` via gRPC to check if an update is available. 2. If available, calls `apply_update()` which orchestrates download, verification, yield, and atomic binary swap. The `OtaProvider` trait (`traits.rs`) defines `start_manager()` and `as_any()` for downcasting, and `OtaEvent::YieldRequested` signals the agent to detach eBPF hooks and pin maps before binary swap. ## Resume-Safe Download The download is checkpointed to `checkpoint.json` (serialized via `serde_json`) so that if the connection drops, the next cycle resumes from the last verified offset. `DownloadCheckpoint` struct tracks: - `target_version` — the version being downloaded - `bytes_downloaded` — current byte offset - `total_size` — expected package size - `checksum_sha256` — expected SHA-256 hash - `signature` — Ed25519 signature from the Hub If a checkpoint exists for a *different* target version, it is discarded and the download restarts from zero. ## Integrity Verification After the full binary is downloaded, two checks are performed (`manager.rs`): 1. **SHA-256 hash** — the downloaded binary is hashed and compared against `checksum_sha256` from the update manifest. 2. **Ed25519 signature** — the Hub's public key (`config.hub.public_key`) is decoded to a `VerifyingKey` and verifies the binary data against the attached signature. Only if both checks pass does the manager proceed to yield. ## Yield and Atomic Binary Swap Upon successful verification, `OtaManager` sends `OtaEvent::YieldRequested` to the agent and waits for acknowledgment. The agent is expected to cleanly detach eBPF hooks and pin maps before acknowledging. After yield: - `OtaManager` renames `update.tmp` to a staging binary path (`update_staging`) and sets executable permissions (`0o755` on Unix). - `OtaManager` calls `rollback.commit_upgrade()` which removes the backup binary (the upgrade was successful). - `OtaManager` invokes `restart::replace_self_with_new_binary(&staging_path)` which sets executable permissions and calls `std::process::Command` with `exec()` to replace the running process with the new binary, preserving CLI arguments. ## Rollback (Production) > **Status**: Fully implemented and tested. `RollbackManager` (`rollback.rs`) provides three methods: - `prepare_upgrade()` — reads the currently running binary via `std::env::current_exe()` and copies it to `rollback.bin`. - `commit_upgrade()` — removes the backup file after a successful upgrade. - `trigger_revert()` — renames `rollback.bin` to `rollback_current.bin` and calls `exec()` into it. **Infinite loop prevention**: A `rollback.marker` file is written before exec. If the new process detects the marker, it refuses to rollback again (returning a `Consecutive rollback detected` error) until the marker is cleared by a successful heartbeat. ## Heartbeat Watchdog (Production) > **Status**: Fully implemented and tested. `OtaManager` spawns a `heartbeat_watchdog` background task that runs every 60 seconds. The watchdog reads the `last_checkin` timestamp, shared with the agent's telemetry loop via `HeartbeatCheckin` (`telemetry.rs`). **Telemetry integration**: The telemetry loop (`run_telemetry_loop`) records a successful Hub report by calling `heartbeat.record()` after every successful `client.report_metrics()` call. The OTA manager shares this same `Arc<parking_lot::Mutex<Instant>>` so it can monitor Hub connectivity. **Auto-rollback trigger**: If no heartbeat is recorded within `rollback_timeout` (default 300s, configurable via `with_rollback_timeout`), the watchdog calls `rollback.trigger_revert()` and terminates the process. ## Warm Start Snapshot (Production) > **Status**: Fully implemented and tested. `StateSnapshot` (`state.rs`) is a postcard-serializable structure containing peer routing data (node ID, virtual IP, public key, last known ICE endpoint). Roundtrip serialization/deserialization is tested. **Capture flow**: `BauxiteNode::get_snapshot_state()` iterates over all peers in `_peer_registry`, collects their virtual IPs and public keys, and serializes the `StateSnapshot` to a postcard blob. **Restore flow**: `BauxiteNode::inject_snapshot_state()` deserializes the blob, iterates over `PeerSnapshot` entries, and calls `upsert_peer()` to restore routing table entries. Session keys are **never restored** to maintain Perfect Forward Secrecy (PFS). **Bootstrap integration**: `build_forge_agent()` in `bauxite-forge/src/lib.rs` attempts to pull the warm-start blob from the Hub during bootstrap (with a 5s timeout). The blob is merged into `BootstrapResult::initial_state` and injected during `BauxiteNode::new()`. Graceful timeout/failure — the node starts with a cold peer table if the Hub is unreachable. ## Post-Restart Behavior After `replace_self_with_new_binary()` calls `exec()`, the new process starts fresh. The CLI arguments are preserved so the same command line is used. The `restart.rs` module handles: - `restart.rs` provides `replace_self_with_new_binary()` which sets executable permissions and calls `exec()` to replace the running process with the new binary, preserving CLI arguments. - If exec fails, the process logs the error and returns — the caller should handle the failure case.