Scaling AI Companions to 100K Users: Lessons Learned

The moment we crossed 10,000 daily active users, our infrastructure started screaming. Not in the dramatic way you might expect—no cascading failures or 3 AM pages. Instead, we noticed something more insidious: response latencies creeping up by 50ms here, memory retrieval becoming inconsistent there, and most concerning, users reporting that their AI companions felt "different" from one conversation to the next.

Building AI companions that remember, feel, and initiate is fundamentally different from building a stateless chatbot. When you scale AI applications with persistent memory and emotional continuity, you're not just scaling compute—you're scaling identity itself.

At Promitheus, we've spent the past eighteen months learning these lessons the hard way. Here's what we discovered scaling to 100,000 users.

The Fundamental Challenge: State at Scale

Traditional web applications scale horizontally because requests are largely independent. AI companion architecture breaks this assumption entirely. Each conversation requires memory context, personality consistency, emotional state, and temporal awareness.

Multiply these requirements by 100,000 users, each potentially having multiple companions, and you begin to understand the infrastructure challenge.

Lesson 1: Memory Storage Architecture

Our first architecture was embarrassingly naive: a single PostgreSQL database with a memories table. It worked beautifully until about 5,000 users, then query times started exceeding acceptable limits.

Per-User Data Isolation

The first critical decision was moving to per-user data isolation. Not just logical separation with user IDs, but physical separation using a sharded architecture:

Predictable query performance: Searching 10,000 memories for one user is fundamentally different from searching 1 billion memories and filtering.

Data locality: User data can be co-located geographically.

Privacy guarantees: Physical separation makes accidental data leakage architecturally impossible.

Hot/Warm/Cold Storage Tiers

We implemented a three-tier storage system:

Hot tier (Redis): Last 7 days of conversations, frequently accessed memories, current emotional state. Sub-10ms retrieval.

Warm tier (DynamoDB + Pinecone): Last 90 days, all semantically significant memories. 50-100ms retrieval.

Cold tier (S3 + archived embeddings): Everything older. 200-500ms retrieval, accessed only when specifically relevant.

Cost Management

We implemented aggressive deduplication, compression for cold storage, and memory summarization—reducing storage by roughly 80% while maintaining semantic richness.

Lesson 2: Retrieval Latency

Users perceive AI companions as slow when total response time exceeds about 3 seconds. With LLM inference taking 1-2 seconds, that leaves precious little time for memory retrieval.

We established a latency budget: 200ms for memory retrieval, 100ms for personality application, 100ms for context construction, leaving 2.5 seconds for LLM inference.

Parallel Operations

Instead of sequential operations, we parallelized aggressively—memory retrieval, personality lookup, and emotional state fetch all happen simultaneously.

Caching Strategies

Session cache: Current conversation context, refreshed every message. No retrieval needed for continuing conversations.

User profile cache: Core memories, personality parameters, relationship summary. Refreshed every 15 minutes.

Embedding cache: Recent query embeddings to avoid recomputation.

The session cache alone eliminated 60% of our retrieval operations.

Pre-computation

We implemented background jobs that run after each conversation, pre-computing likely next retrievals. When a user returns, their most probable memory needs are already in hot cache.

Lesson 3: Personality Consistency

Nothing breaks the illusion faster than a companion whose personality shifts between conversations.

Maintaining Character

Personality in LLM-based companions is fundamentally a prompt engineering challenge at scale. We represent personality as a structured document with immutable core traits, mutable style parameters, and conversation-specific adaptations.

Prompt Management at Scale

We treat personality prompts as code, with version control, testing, and staged rollouts:

A/B testing with 1% of users

Automated consistency scoring against reference conversations

Gradual rollout with monitoring

Rollback capability at the user level

Lesson 4: Emotional State Management

Efficient State Persistence

Emotional state is surprisingly compact: we represent it as a vector of ~20 dimensions plus a small set of active emotional markers.

This state persists in Redis for active users, serializes to DynamoDB between sessions, and updates after every conversation turn.

Batch Processing for Background Updates

Emotional state shouldn't just reflect conversations—it should evolve with time. We run hourly batch jobs that update emotional states for all users based on time-based rules.

Lesson 5: Cost Optimization

Production AI at scale is expensive. At 100,000 users, unoptimized token usage can easily exceed $100,000/month in LLM costs alone.

Token Usage Management

We implemented strict token budgets per conversation turn. Memory context gets summarized. Personality prompts use efficient representations.

Key insight: users can't tell the difference between 8,000 tokens of context and 4,000 tokens of well-curated context, but your costs differ by 2x.

Tiered Service Levels

Different users have different needs. We implemented service tiers that adjust memory retrieval depth, context richness, background processing frequency, and response length limits.

Lesson 6: Monitoring and Observability

Tracking Relationship Quality Metrics

We developed composite metrics:

Conversation depth: Average turns per session, topic progression

Return rate: How often users come back

Emotional reciprocity: Whether users engage with emotional content

Memory utilization: How often retrieved memories prove relevant

Detecting Degraded Experiences

We alert on metric degradation at the user level, not just aggregate. If a specific user's experience is declining, we can investigate before they churn.

Build vs. Outsource

Here's our honest advice:

Build yourself: Application-specific personality design, user experience, conversation flows, domain-specific memory importance scoring.

Outsource: Memory infrastructure, retrieval optimization, emotional state management, scaling concerns. Unless AI infrastructure is your core competency, these are solved problems.

The companies we see succeed focus relentlessly on what makes their companions unique, while leveraging infrastructure that handles the hard scaling problems.

Scaling AI companions is hard. But it's a solvable kind of hard—one that yields to careful architecture, obsessive measurement, and learning from others' mistakes.