What is Latency?
Latency latency in AI refers to the time between sending a request and receiving a response. Low latency (fast responses) is crucial for interactive applications, while high latency degrades user experience. Latency depends on model size, infrastructure, and request complexity.
On this page
What is Latency?
Latency is the delay from request to response in AI systems. It includes: network time (request traveling to server), queue time (waiting for processing capacity), computation time (model generating output), and response transmission. For AI, latency is typically measured to first token (when streaming begins) and total completion. Latency varies widely: small models on fast infrastructure might respond in 100ms; large models with complex prompts might take 10+ seconds. Low latency enables conversational AI; high latency relegates AI to batch processing.
How Latency Works
AI latency has several components. Time-to-first-token (TTFT) is primarily model loading, prompt processing, and generating the first output token—this determines how quickly users see responses start. Token generation latency is per-token computation, which determines how quickly streaming responses flow. Factors affecting latency: model size (larger = slower), prompt length (longer context = more processing), output length, hardware (GPUs, TPUs, specialized chips), infrastructure (dedicated vs. shared capacity), and geographic distance to servers. Providers offer different latency profiles—some optimize for speed, others for cost.
Why Latency Matters
Latency determines user experience. Research shows users abandon interactions with delays over 2-3 seconds. For coding assistants, slow suggestions interrupt flow. For chat, latency breaks conversational rhythm. For real-time applications (voice assistants, live translation), latency is critical. Understanding latency helps in: choosing models (faster vs. more capable), designing UX (streaming, loading indicators), sizing infrastructure, and setting user expectations.
Examples of Latency
A fast coding autocomplete needs <200ms latency to feel instant. A chat assistant feels responsive with <1s time-to-first-token. Voice assistants need end-to-end latency under 300ms for natural conversation. Batch processing tasks (document analysis, bulk classification) can tolerate higher latency for better quality or lower cost. Real-time game NPCs need very low latency to feel responsive.
Common Misconceptions
Latency isn't just about model speed—network, infrastructure, and queue times often dominate. Another misconception is that latency is fixed; it varies by load, time of day, and request specifics. Streaming doesn't reduce total latency but reduces perceived latency by showing progress. Lower latency often trades off against quality or cost—faster models may be smaller or less capable.
Key Takeaways
- 1Latency is a fundamental concept in building AI that maintains persistent relationships with users.
- 2Understanding latency is essential for developers building relational AI, companions, or any AI that benefits from knowing its users.
- 3Promitheus provides infrastructure for implementing latency and other identity capabilities in production AI applications.
Written by the Promitheus Team
Part of the AI Glossary · 50 terms
Build AI with Latency
Promitheus provides the infrastructure to implement latency and other identity capabilities in your AI applications.