Token optimization in FlyAI API Gateway

LLM cost grows linearly with token count. At 50–100M tokens / month, a 40% saving = thousands of dollars per month.

Technique 1: semantic cache

If two requests are semantically close, we serve from cache. We store the embedding of the request; cosine similarity > 0.92 triggers reuse.

Result: 25–35% of requests are served from cache on typical enterprise loads.

Technique 2: complexity routing

Simple requests (classification, extraction) → cheap models (Gemini Flash Lite, Haiku). Complex (reasoning, long context) → top-tier (GPT-5.2, Claude Sonnet 4.5).

Result: up to 60% reduction in average token cost.

Technique 3: context compression

Long documents are summarized by a cheap model first, and a compact digest is sent with pinpoint citations of relevant chunks.

Result: 100k-token contexts shrink to 8–12k without quality loss.

Technique 4: batching

Multiple independent requests are bundled into a single batch call, lowering overhead.

Real numbers