Skip to main content

Throughput Manager

The Throughput Manager is a core infrastructure component designed to maximize reliability and throughput by intelligently distributing requests across multiple LLM models and handling rate limit failures gracefully.

Aim & Purpose

Key Goals

  • Maximize Throughput: Distribute load across multiple models to achieve higher overall request capacity
  • Ensure Reliability: Automatic failover when individual models hit rate limits
  • Zero Downtime: Seamless request handling even during provider throttling
  • Simple Configuration: Enable via environment variables with minimal setup

How It Works

Multi-Model Load Balancing

The Throughput Manager maintains a pool of configured models and cycles through them using a round-robin approach. This distributes requests evenly across all available models, preventing any single model from becoming a bottleneck.

Request 1 → Model A
Request 2 → Model B
Request 3 → Model C
Request 4 → Model A (cycles back)
...

Automatic Failover & Cooldown

Rate Limit Detection

Automatically detects HTTP 429 (Too Many Requests) errors from providers and marks the affected model as temporarily unavailable.

1-Minute Cooldown

Rate-limited models enter a 1-minute cooldown period. During this time, requests are automatically routed to other available models.

Failover Flow

1. Request sent to Model A
2. Model A returns HTTP 429 (rate limited)
3. Throughput Manager marks Model A as "cooling down"
4. Request automatically retried with Model B
5. After 1 minute, Model A becomes available again

Configuration

Enable the Throughput Manager using environment variables:

Environment Variables

VariableDescriptionExample
THROUGHPUT_MANAGEREnable/disable the throughput managertrue
THROUGHPUT_MODELSComma-separated list of model identifiers to usegpt-4o,gpt-4o-mini,gpt-4-turbo

Example Configuration

# Enable throughput management
export THROUGHPUT_MANAGER=true

# Configure multiple models for load balancing
export THROUGHPUT_MODELS=gpt-4o,gpt-4o-mini,gpt-4-turbo

Docker Configuration

services:
inferno:
environment:
- THROUGHPUT_MANAGER=true
- THROUGHPUT_MODELS=gpt-4o,gpt-4o-mini,gpt-4-turbo

Best Practices

  1. Mix model tiers: Combine high-capacity and standard models for balanced cost and performance
  2. Use similar capabilities: Ensure all models in the pool can handle your use case
  3. Consider rate limits: Different models may have different rate limit thresholds
  4. Monitor usage: Track which models are being rate-limited most frequently

Benefits

BenefitDescription
Higher ThroughputAggregate rate limits across multiple models
Improved ReliabilityNo single point of failure
Automatic RecoverySelf-healing without manual intervention
Cost OptimizationDistribute load to prevent expensive retry storms
Zero Configuration ChangesWorks transparently with existing request patterns

When to Use

The Throughput Manager is particularly valuable when:

  • High-volume applications: Processing many requests that might hit rate limits
  • Production systems: Where reliability and uptime are critical
  • Burst traffic: Applications with unpredictable traffic spikes
  • Multi-tenant systems: Serving multiple users with shared rate limits
tip

The Throughput Manager works seamlessly with other ObjectWeaver features like Batch & Priority System and Streaming Requests.