How does DHARA achieve sub-100ms retrieval?

DHARA is a production retrieval-augmented generation system designed to answer queries over more than 1 million documents while sustaining sub-100ms p95 retrieval latency. The architecture combines query normalization, vector retrieval, cache-aware request routing, and reranking before grounded response generation. Rather than optimize one stage in isolation, the system tunes caching and reranking together so fast responses do not degrade factual quality. In multi-tenant conditions, DHARA uses backpressure-aware serving to remain stable under concurrency spikes while preserving response consistency. The measurable outcome is a high-throughput retrieval path that keeps latency predictable at production load and maintains strong answer quality by prioritizing relevant context selection before generation. This project demonstrates practical RAG engineering where throughput, ranking quality, and reliability are treated as coupled system constraints.

Source: DHARA repository