KV-Aware Routing (Enterprise)

KV-Aware Routing is a specialized load-balancing strategy designed for large-scale RAG (Retrieval-Augmented Generation) and multi-turn conversations. It optimizes for Time-To-First-Token (TTFT) by ensuring requests are routed to vLLM nodes that already contain the relevant prefix in their GPU cache.

The Problem: Prefill Latency

In standard load balancing (Round Robin), a request with a 10k token context might land on a “cold” node. This forces the GPU to re-compute the Key-Value (KV) cache for the entire prompt, adding seconds of latency.

The Solution: Global Radix Tree

Bauxite Enterprise maintains a real-time map of prefix hashes across your entire inference fleet.

Bauxite Diagram

Zero-Trust Compatibility: Volatile-Only Caching

While standard KV-caching solutions often persist data to local SSDs or external databases, Bauxite enforces a Volatile-Only policy to maintain the “20MB Straitjacket” security model.

RAM-Only Radix Tree: Prefix maps are stored strictly in memory. If the memory ceiling is reached, Bauxite evicts the oldest hashes rather than spilling to disk.

Scrubbed Cache: The PII Janitor runs before the KV-router. Only redacted prompt signatures are ever cached, ensuring sensitive data like API keys or PII never reside in the KV-blocks.

Disk-Lockout: Even when integrated with external caching layers, Bauxite’s internal logic is structurally incapable of calling os.WriteFile.

Configuration

Enable KV-Aware routing by defining your vLLM cluster in the enterprise config block:

enterprise:
  routing:
    strategy: kv_aware
    radix_depth: 3 # Number of blocks to track
    fallback: least_busy
  clusters:
    - name: vllm-prod
      endpoints: ["http://node-a:8000", "http://node-b:8000"]

LMCache Integration

Bauxite Enterprise works natively with LMCache. When a request arrives, Bauxite can pre-emptively signal LMCache to swap the required KV-blocks from CPU/SSD to GPU memory before the HTTP request even hits the inference engine.

Credits

Bauxite’s routing engine leverages prefix-matching logic from the LMCache project. We extend these patterns to work within a memory-safe Go runtime optimized for ephemeral, zero-disk environments.