Scaling AI Companions to 100K Users: Lessons Learned
Hard-won lessons from scaling relational AI infrastructure—memory storage architecture, retrieval latency, personality consistency, and cost optimization.
The moment we crossed 10,000 daily active users, our infrastructure started screaming. Not in the dramatic way you might expect—no cascading failures or 3 AM pages. Instead, we noticed something more insidious: response latencies creeping up by 50ms here, memory retrieval becoming inconsistent there, and most concerning, users reporting that their AI companions felt "different" from one conversation to the next.
Building AI companions that remember, feel, and initiate is fundamentally different from building a stateless chatbot. When you scale AI applications with persistent memory and emotional continuity, you're not just scaling compute—you're scaling identity itself.
At Promitheus, we've spent the past eighteen months learning these lessons the hard way. Here's what we discovered scaling to 100,000 users.
The Fundamental Challenge: State at Scale
Traditional web applications scale horizontally because requests are largely independent. AI companion architecture breaks this assumption entirely. Each conversation requires memory context, personality consistency, emotional state, and temporal awareness.
Multiply these requirements by 100,000 users, each potentially having multiple companions, and you begin to understand the infrastructure challenge.
Lesson 1: Memory Storage Architecture
Our first architecture was embarrassingly naive: a single PostgreSQL database with a memories table. It worked beautifully until about 5,000 users, then query times started exceeding acceptable limits.
Per-User Data Isolation
The first critical decision was moving to per-user data isolation. Not just logical separation with user IDs, but physical separation using a sharded architecture:
Hot/Warm/Cold Storage Tiers
We implemented a three-tier storage system:
Hot tier (Redis): Last 7 days of conversations, frequently accessed memories, current emotional state. Sub-10ms retrieval.
Warm tier (DynamoDB + Pinecone): Last 90 days, all semantically significant memories. 50-100ms retrieval.
Cold tier (S3 + archived embeddings): Everything older. 200-500ms retrieval, accessed only when specifically relevant.
Cost Management
We implemented aggressive deduplication, compression for cold storage, and memory summarization—reducing storage by roughly 80% while maintaining semantic richness.
Lesson 2: Retrieval Latency
Users perceive AI companions as slow when total response time exceeds about 3 seconds. With LLM inference taking 1-2 seconds, that leaves precious little time for memory retrieval.
We established a latency budget: 200ms for memory retrieval, 100ms for personality application, 100ms for context construction, leaving 2.5 seconds for LLM inference.
Parallel Operations
Instead of sequential operations, we parallelized aggressively—memory retrieval, personality lookup, and emotional state fetch all happen simultaneously.
Caching Strategies
Session cache: Current conversation context, refreshed every message. No retrieval needed for continuing conversations.
User profile cache: Core memories, personality parameters, relationship summary. Refreshed every 15 minutes.
Embedding cache: Recent query embeddings to avoid recomputation.
The session cache alone eliminated 60% of our retrieval operations.
Pre-computation
We implemented background jobs that run after each conversation, pre-computing likely next retrievals. When a user returns, their most probable memory needs are already in hot cache.
Lesson 3: Personality Consistency
Nothing breaks the illusion faster than a companion whose personality shifts between conversations.
Maintaining Character
Personality in LLM-based companions is fundamentally a prompt engineering challenge at scale. We represent personality as a structured document with immutable core traits, mutable style parameters, and conversation-specific adaptations.
Prompt Management at Scale
We treat personality prompts as code, with version control, testing, and staged rollouts:
Lesson 4: Emotional State Management
Efficient State Persistence
Emotional state is surprisingly compact: we represent it as a vector of ~20 dimensions plus a small set of active emotional markers.
This state persists in Redis for active users, serializes to DynamoDB between sessions, and updates after every conversation turn.
Batch Processing for Background Updates
Emotional state shouldn't just reflect conversations—it should evolve with time. We run hourly batch jobs that update emotional states for all users based on time-based rules.
Lesson 5: Cost Optimization
Production AI at scale is expensive. At 100,000 users, unoptimized token usage can easily exceed $100,000/month in LLM costs alone.
Token Usage Management
We implemented strict token budgets per conversation turn. Memory context gets summarized. Personality prompts use efficient representations.
Key insight: users can't tell the difference between 8,000 tokens of context and 4,000 tokens of well-curated context, but your costs differ by 2x.
Tiered Service Levels
Different users have different needs. We implemented service tiers that adjust memory retrieval depth, context richness, background processing frequency, and response length limits.
Lesson 6: Monitoring and Observability
Tracking Relationship Quality Metrics
We developed composite metrics:
Detecting Degraded Experiences
We alert on metric degradation at the user level, not just aggregate. If a specific user's experience is declining, we can investigate before they churn.
Build vs. Outsource
Here's our honest advice:
Build yourself: Application-specific personality design, user experience, conversation flows, domain-specific memory importance scoring.
Outsource: Memory infrastructure, retrieval optimization, emotional state management, scaling concerns. Unless AI infrastructure is your core competency, these are solved problems.
The companies we see succeed focus relentlessly on what makes their companions unique, while leveraging infrastructure that handles the hard scaling problems.
Scaling AI companions is hard. But it's a solvable kind of hard—one that yields to careful architecture, obsessive measurement, and learning from others' mistakes.
About the Author
Promitheus Team
Engineering
The team building Promitheus—engineers, researchers, and designers passionate about relational AI.