Throughput Manager
The Throughput Manager is a core infrastructure component designed to maximize reliability and throughput by intelligently distributing requests across multiple LLM models and handling rate limit failures gracefully.
Aim & Purpose
Key Goals
- Maximize Throughput: Distribute load across multiple models to achieve higher overall request capacity
- Ensure Reliability: Automatic failover when individual models hit rate limits
- Zero Downtime: Seamless request handling even during provider throttling
- Simple Configuration: Enable via environment variables with minimal setup
How It Works
Multi-Model Load Balancing
The Throughput Manager maintains a pool of configured models and cycles through them using a round-robin approach. This distributes requests evenly across all available models, preventing any single model from becoming a bottleneck.
Request 1 → Model A
Request 2 → Model B
Request 3 → Model C
Request 4 → Model A (cycles back)
...
Automatic Failover & Cooldown
Rate Limit Detection
Automatically detects HTTP 429 (Too Many Requests) errors from providers and marks the affected model as temporarily unavailable.
1-Minute Cooldown
Rate-limited models enter a 1-minute cooldown period. During this time, requests are automatically routed to other available models.
Failover Flow
1. Request sent to Model A
2. Model A returns HTTP 429 (rate limited)
3. Throughput Manager marks Model A as "cooling down"
4. Request automatically retried with Model B
5. After 1 minute, Model A becomes available again
Configuration
Enable the Throughput Manager using environment variables:
Environment Variables
| Variable | Description | Example |
|---|---|---|
THROUGHPUT_MANAGER | Enable/disable the throughput manager | true |
THROUGHPUT_MODELS | Comma-separated list of model identifiers to use | gpt-4o,gpt-4o-mini,gpt-4-turbo |
Example Configuration
# Enable throughput management
export THROUGHPUT_MANAGER=true
# Configure multiple models for load balancing
export THROUGHPUT_MODELS=gpt-4o,gpt-4o-mini,gpt-4-turbo
Docker Configuration
services:
inferno:
environment:
- THROUGHPUT_MANAGER=true
- THROUGHPUT_MODELS=gpt-4o,gpt-4o-mini,gpt-4-turbo
Best Practices
Recommended Model Selection
- Mix model tiers: Combine high-capacity and standard models for balanced cost and performance
- Use similar capabilities: Ensure all models in the pool can handle your use case
- Consider rate limits: Different models may have different rate limit thresholds
- Monitor usage: Track which models are being rate-limited most frequently
Benefits
| Benefit | Description |
|---|---|
| Higher Throughput | Aggregate rate limits across multiple models |
| Improved Reliability | No single point of failure |
| Automatic Recovery | Self-healing without manual intervention |
| Cost Optimization | Distribute load to prevent expensive retry storms |
| Zero Configuration Changes | Works transparently with existing request patterns |
When to Use
The Throughput Manager is particularly valuable when:
- High-volume applications: Processing many requests that might hit rate limits
- Production systems: Where reliability and uptime are critical
- Burst traffic: Applications with unpredictable traffic spikes
- Multi-tenant systems: Serving multiple users with shared rate limits
The Throughput Manager works seamlessly with other ObjectWeaver features like Batch & Priority System and Streaming Requests.