Throughput Manager

The Throughput Manager is a core infrastructure component designed to maximize reliability and throughput by intelligently distributing requests across multiple LLM models and handling rate limit failures gracefully.

Aim & Purpose

Key Goals

Maximize Throughput: Distribute load across multiple models to achieve higher overall request capacity
Ensure Reliability: Automatic failover when individual models hit rate limits
Zero Downtime: Seamless request handling even during provider throttling
Simple Configuration: Enable via environment variables with minimal setup

How It Works

Multi-Model Load Balancing

The Throughput Manager maintains a pool of configured models and cycles through them using a round-robin approach. This distributes requests evenly across all available models, preventing any single model from becoming a bottleneck.

Request 1 → Model A
Request 2 → Model B  
Request 3 → Model C
Request 4 → Model A (cycles back)
...

Automatic Failover & Cooldown

Rate Limit Detection

Automatically detects HTTP 429 (Too Many Requests) errors from providers and marks the affected model as temporarily unavailable.

1-Minute Cooldown

Rate-limited models enter a 1-minute cooldown period. During this time, requests are automatically routed to other available models.

Failover Flow

Request sent to Model A
Model A returns HTTP 429 (rate limited)
Throughput Manager marks Model A as "cooling down"
Request automatically retried with Model B
After 1 minute, Model A becomes available again

Configuration

Enable the Throughput Manager using environment variables:

Environment Variables

Variable	Description	Example
`THROUGHPUT_MANAGER`	Enable/disable the throughput manager	`true`
`THROUGHPUT_MODELS`	Comma-separated list of model identifiers to use	`gpt-4o,gpt-4o-mini,gpt-4-turbo`

Example Configuration

# Enable throughput management
export THROUGHPUT_MANAGER=true

# Configure multiple models for load balancing
export THROUGHPUT_MODELS=gpt-4o,gpt-4o-mini,gpt-4-turbo

Docker Configuration

services:
  inferno:
    environment:
      - THROUGHPUT_MANAGER=true
      - THROUGHPUT_MODELS=gpt-4o,gpt-4o-mini,gpt-4-turbo

Best Practices

Recommended Model Selection

Mix model tiers: Combine high-capacity and standard models for balanced cost and performance
Use similar capabilities: Ensure all models in the pool can handle your use case
Consider rate limits: Different models may have different rate limit thresholds
Monitor usage: Track which models are being rate-limited most frequently

Benefits

Benefit	Description
Higher Throughput	Aggregate rate limits across multiple models
Improved Reliability	No single point of failure
Automatic Recovery	Self-healing without manual intervention
Cost Optimization	Distribute load to prevent expensive retry storms
Zero Configuration Changes	Works transparently with existing request patterns

When to Use

The Throughput Manager is particularly valuable when:

High-volume applications: Processing many requests that might hit rate limits
Production systems: Where reliability and uptime are critical
Burst traffic: Applications with unpredictable traffic spikes
Multi-tenant systems: Serving multiple users with shared rate limits

tip

The Throughput Manager works seamlessly with other ObjectWeaver features like Batch & Priority System and Streaming Requests.

Aim & Purpose​

Key Goals

How It Works​

Multi-Model Load Balancing​

Automatic Failover & Cooldown​

Rate Limit Detection

1-Minute Cooldown

Failover Flow​

Configuration​

Environment Variables​

Example Configuration​

Docker Configuration​

Best Practices​

Recommended Model Selection​

Benefits​

When to Use​