KV-Aware Routing (Enterprise)
KV-Aware Routing is a specialized load-balancing strategy designed for large-scale RAG (Retrieval-Augmented Generation) and multi-turn conversations. It optimizes for Time-To-First-Token (TTFT) by ensuring requests are routed to vLLM nodes that already contain the relevant prefix in their GPU cache.
The Problem: Prefill Latency
In standard load balancing (Round Robin), a request with a 10k token context might land on a “cold” node. This forces the GPU to re-compute the Key-Value (KV) cache for the entire prompt, adding seconds of latency.
The Solution: Global Radix Tree
Bauxite Enterprise maintains a real-time map of prefix hashes across your entire inference fleet.
Zero-Trust Compatibility: Volatile-Only Caching
While standard KV-caching solutions often persist data to local SSDs or external databases, Bauxite enforces a Volatile-Only policy to maintain the “20MB Straitjacket” security model.
RAM-Only Radix Tree: Prefix maps are stored strictly in memory. If the memory ceiling is reached, Bauxite evicts the oldest hashes rather than spilling to disk.
Scrubbed Cache: The PII Janitor runs before the KV-router. Only redacted prompt signatures are ever cached, ensuring sensitive data like API keys or PII never reside in the KV-blocks.
Disk-Lockout: Even when integrated with external caching layers, Bauxite’s internal logic is structurally incapable of calling os.WriteFile.
Configuration
Enable KV-Aware routing by defining your vLLM cluster in the enterprise config block:
enterprise:
routing:
strategy: kv_aware
radix_depth: 3 # Number of blocks to track
fallback: least_busy
clusters:
- name: vllm-prod
endpoints: ["http://node-a:8000", "http://node-b:8000"] LMCache Integration
Bauxite Enterprise works natively with LMCache. When a request arrives, Bauxite can pre-emptively signal LMCache to swap the required KV-blocks from CPU/SSD to GPU memory before the HTTP request even hits the inference engine.
Credits
Bauxite’s routing engine leverages prefix-matching logic from the LMCache project. We extend these patterns to work within a memory-safe Go runtime optimized for ephemeral, zero-disk environments.