PromptCache Documentation
A smart semantic cache for high-scale GenAI workloads.
What is PromptCache?
PromptCache is a lightweight middleware that sits between your application and your LLM provider. It uses semantic understanding to detect when a new prompt has the same intent as a previous one β and returns the cached result instantly.
Key Benefits
- Reduce Costs: Save up to 80% on LLM API costs
- Improve Latency: ~300ms vs ~1.5s average response time
- Better Scale: Unlimited throughput without API rate limits
- Smart Matching: Semantic understanding prevents incorrect cache hits
Whatβs New in v0.4.0
- π API Authentication - Bearer-token middleware gating all management endpoints (
API_AUTH_TOKEN) - π Streaming (SSE) - Full
stream: truesupport across OpenAI, Mistral, and Claude β including streamed cache hits - βοΈ Runtime Config API -
GET/PATCH /v1/configfor live threshold and gray-zone updates - π₯ Cache Warming -
POST /v1/cache/warmto bulk pre-populate from historical prompt/response pairs
Previously in v0.3.0
- π Prometheus metrics, structured logging, health checks
- ποΈ Cache management API, LRU eviction
- β‘ ANN index for 5x faster similarity search
- π Graceful shutdown, retry logic
Quick Links
Architecture
PromptCache uses a three-stage verification strategy:
- High similarity (β₯70%) β Direct cache hit
- Low similarity (<30%) β Skip cache directly
- Gray zone (30-70%) β LLM verification for accuracy
This ensures cached responses are semantically correct, not just βclose enoughβ.
Supported Providers
- OpenAI: text-embedding-3-small + gpt-4o-mini
- Mistral AI: mistral-embed + mistral-small-latest
- Claude (Anthropic): voyage-3 + claude-3-haiku
Features
- β Multiple provider support (OpenAI, Mistral, Claude)
- β Dynamic provider switching via API
- β Configurable similarity thresholds (env vars + runtime PATCH)
- β Gray zone verification control
- β OpenAI-compatible API (including SSE streaming)
- β Bearer-token authentication for management endpoints
- β Cache warming from historical data
- β Docker support
- β Thread-safe operations
- β BadgerDB persistence
- β Prometheus metrics export
- β Health check endpoints
- β Cache management API
- β Structured JSON logging
- β LRU cache eviction
- β Request tracing (X-Request-ID)
Community
License
MIT License - see LICENSE for details.