PromptCache Documentation

A smart semantic cache for high-scale GenAI workloads.

What is PromptCache?

PromptCache is a lightweight middleware that sits between your application and your LLM provider. It uses semantic understanding to detect when a new prompt has the same intent as a previous one — and returns the cached result instantly.

Key Benefits

Reduce Costs: Save up to 80% on LLM API costs
Improve Latency: ~300ms vs ~1.5s average response time
Better Scale: Unlimited throughput without API rate limits
Smart Matching: Semantic understanding prevents incorrect cache hits

What’s New in v0.3.0

📊 Prometheus Metrics - Export hit rates, latency, and request counts
🏥 Health Checks - Kubernetes-ready liveness/readiness probes
🗃️ Cache Management API - View stats, clear cache, delete entries
📝 Structured Logging - JSON logs for easy aggregation
⚡ ANN Index - 5x faster similarity search
🔄 Graceful Shutdown - Clean request draining
🔁 Retry Logic - Automatic retries with exponential backoff

Quick Links

Architecture

PromptCache uses a three-stage verification strategy:

High similarity (≥70%) → Direct cache hit
Low similarity (<30%) → Skip cache directly
Gray zone (30-70%) → LLM verification for accuracy

This ensures cached responses are semantically correct, not just “close enough”.

Supported Providers

OpenAI: text-embedding-3-small + gpt-4o-mini
Mistral AI: mistral-embed + mistral-small-latest
Claude (Anthropic): voyage-3 + claude-3-haiku

Features

✅ Multiple provider support (OpenAI, Mistral, Claude)
✅ Dynamic provider switching via API
✅ Configurable similarity thresholds
✅ Gray zone verification control
✅ OpenAI-compatible API
✅ Docker support
✅ Thread-safe operations
✅ BadgerDB persistence
✅ Prometheus metrics export
✅ Health check endpoints
✅ Cache management API
✅ Structured JSON logging
✅ LRU cache eviction
✅ Request tracing (X-Request-ID)

Community

License

MIT License - see LICENSE for details.