Token optimization in FlyAI API Gateway
LLM cost grows linearly with token count. At 50–100M tokens / month, a 40% saving = thousands of dollars per month.
Technique 1: semantic cache
If two requests are semantically close, we serve from cache. We store the embedding of the request; cosine similarity > 0.92 triggers reuse.
Result: 25–35% of requests are served from cache on typical enterprise loads.
Technique 2: complexity routing
Simple requests (classification, extraction) → cheap models (Gemini Flash Lite, Haiku). Complex (reasoning, long context) → top-tier (GPT-5.2, Claude Sonnet 4.5).
Result: up to 60% reduction in average token cost.
Technique 3: context compression
Long documents are summarized by a cheap model first, and a compact digest is sent with pinpoint citations of relevant chunks.
Result: 100k-token contexts shrink to 8–12k without quality loss.
Technique 4: batching
Multiple independent requests are bundled into a single batch call, lowering overhead.
Real numbers
Production loads of our customers see 40–60% savings vs direct OpenAI/Anthropic integrations. No quality compromises.