PromptCache Documentation
A smart semantic cache for high-scale GenAI workloads.
What is PromptCache?
PromptCache is a lightweight middleware that sits between your application and your LLM provider. It uses semantic understanding to detect when a new prompt has the same intent as a previous one β and returns the cached result instantly.
Key Benefits
- Reduce Costs: Save up to 80% on LLM API costs
- Improve Latency: ~300ms vs ~1.5s average response time
- Better Scale: Unlimited throughput without API rate limits
- Smart Matching: Semantic understanding prevents incorrect cache hits
Whatβs New in v0.3.0
- π Prometheus Metrics - Export hit rates, latency, and request counts
- π₯ Health Checks - Kubernetes-ready liveness/readiness probes
- ποΈ Cache Management API - View stats, clear cache, delete entries
- π Structured Logging - JSON logs for easy aggregation
- β‘ ANN Index - 5x faster similarity search
- π Graceful Shutdown - Clean request draining
- π Retry Logic - Automatic retries with exponential backoff
Quick Links
Architecture
PromptCache uses a three-stage verification strategy:
- High similarity (β₯70%) β Direct cache hit
- Low similarity (<30%) β Skip cache directly
- Gray zone (30-70%) β LLM verification for accuracy
This ensures cached responses are semantically correct, not just βclose enoughβ.
Supported Providers
- OpenAI: text-embedding-3-small + gpt-4o-mini
- Mistral AI: mistral-embed + mistral-small-latest
- Claude (Anthropic): voyage-3 + claude-3-haiku
Features
- β Multiple provider support (OpenAI, Mistral, Claude)
- β Dynamic provider switching via API
- β Configurable similarity thresholds
- β Gray zone verification control
- β OpenAI-compatible API
- β Docker support
- β Thread-safe operations
- β BadgerDB persistence
- β Prometheus metrics export
- β Health check endpoints
- β Cache management API
- β Structured JSON logging
- β LRU cache eviction
- β Request tracing (X-Request-ID)
Community
License
MIT License - see LICENSE for details.