DOCS ota mechanics
Support
# OTA Updates

> **Location**: Enterprise crate `bauxite-forge/src/ota/`.

Bauxite Forge provides an Over-The-Air update client that checks for updates via gRPC, downloads them in resumable chunks, verifies integrity before restart, and provides rollback, warm-start, and heartbeat monitoring capabilities.

## Update Cycle

`OtaManager` runs a background loop (300s poll interval) that:

1. Calls `client.check_for_updates()` via gRPC to check if an update is available.
2. If available, calls `apply_update()` which orchestrates download, verification, yield, and atomic binary swap.

The `OtaProvider` trait (`traits.rs`) defines `start_manager()` and `as_any()` for downcasting, and `OtaEvent::YieldRequested` signals the agent to detach eBPF hooks and pin maps before binary swap.

## Resume-Safe Download

The download is checkpointed to `checkpoint.json` (serialized via `serde_json`) so that if the connection drops, the next cycle resumes from the last verified offset.

`DownloadCheckpoint` struct tracks:
- `target_version` — the version being downloaded
- `bytes_downloaded` — current byte offset
- `total_size` — expected package size
- `checksum_sha256` — expected SHA-256 hash
- `signature` — Ed25519 signature from the Hub

If a checkpoint exists for a *different* target version, it is discarded and the download restarts from zero.

## Integrity Verification

After the full binary is downloaded, two checks are performed (`manager.rs`):

1. **SHA-256 hash** — the downloaded binary is hashed and compared against `checksum_sha256` from the update manifest.
2. **Ed25519 signature** — the Hub's public key (`config.hub.public_key`) is decoded to a `VerifyingKey` and verifies the binary data against the attached signature.

Only if both checks pass does the manager proceed to yield.

## Yield and Atomic Binary Swap

Upon successful verification, `OtaManager` sends `OtaEvent::YieldRequested` to the agent and waits for acknowledgment. The agent is expected to cleanly detach eBPF hooks and pin maps before acknowledging.

After yield:
- `OtaManager` renames `update.tmp` to a staging binary path (`update_staging`) and sets executable permissions (`0o755` on Unix).
- `OtaManager` calls `rollback.commit_upgrade()` which removes the backup binary (the upgrade was successful).
- `OtaManager` invokes `restart::replace_self_with_new_binary(&staging_path)` which sets executable permissions and calls `std::process::Command` with `exec()` to replace the running process with the new binary, preserving CLI arguments.

## Rollback (Production)

> **Status**: Fully implemented and tested.

`RollbackManager` (`rollback.rs`) provides three methods:
- `prepare_upgrade()` — reads the currently running binary via `std::env::current_exe()` and copies it to `rollback.bin`.
- `commit_upgrade()` — removes the backup file after a successful upgrade.
- `trigger_revert()` — renames `rollback.bin` to `rollback_current.bin` and calls `exec()` into it.

**Infinite loop prevention**: A `rollback.marker` file is written before exec. If the new process detects the marker, it refuses to rollback again (returning a `Consecutive rollback detected` error) until the marker is cleared by a successful heartbeat.

## Heartbeat Watchdog (Production)

> **Status**: Fully implemented and tested.

`OtaManager` spawns a `heartbeat_watchdog` background task that runs every 60 seconds. The watchdog reads the `last_checkin` timestamp, shared with the agent's telemetry loop via `HeartbeatCheckin` (`telemetry.rs`).

**Telemetry integration**: The telemetry loop (`run_telemetry_loop`) records a successful Hub report by calling `heartbeat.record()` after every successful `client.report_metrics()` call. The OTA manager shares this same `Arc<parking_lot::Mutex<Instant>>` so it can monitor Hub connectivity.

**Auto-rollback trigger**: If no heartbeat is recorded within `rollback_timeout` (default 300s, configurable via `with_rollback_timeout`), the watchdog calls `rollback.trigger_revert()` and terminates the process.

## Warm Start Snapshot (Production)

> **Status**: Fully implemented and tested.

`StateSnapshot` (`state.rs`) is a postcard-serializable structure containing peer routing data (node ID, virtual IP, public key, last known ICE endpoint). Roundtrip serialization/deserialization is tested.

**Capture flow**: `BauxiteNode::get_snapshot_state()` iterates over all peers in `_peer_registry`, collects their virtual IPs and public keys, and serializes the `StateSnapshot` to a postcard blob.

**Restore flow**: `BauxiteNode::inject_snapshot_state()` deserializes the blob, iterates over `PeerSnapshot` entries, and calls `upsert_peer()` to restore routing table entries. Session keys are **never restored** to maintain Perfect Forward Secrecy (PFS).

**Bootstrap integration**: `build_forge_agent()` in `bauxite-forge/src/lib.rs` attempts to pull the warm-start blob from the Hub during bootstrap (with a 5s timeout). The blob is merged into `BootstrapResult::initial_state` and injected during `BauxiteNode::new()`. Graceful timeout/failure — the node starts with a cold peer table if the Hub is unreachable.

## Post-Restart Behavior

After `replace_self_with_new_binary()` calls `exec()`, the new process starts fresh. The CLI arguments are preserved so the same command line is used. The `restart.rs` module handles:
- `restart.rs` provides `replace_self_with_new_binary()` which sets executable permissions and calls `exec()` to replace the running process with the new binary, preserving CLI arguments.
- If exec fails, the process logs the error and returns — the caller should handle the failure case.