Scaling AI Companions to 100K Users: Lessons Learned

P
Promitheus Team
5 min read946 words

Hard-won lessons from scaling relational AI infrastructure—memory storage architecture, retrieval latency, personality consistency, and cost optimization.

The moment we crossed 10,000 daily active users, our infrastructure started screaming. Not in the dramatic way you might expect—no cascading failures or 3 AM pages. Instead, we noticed something more insidious: response latencies creeping up by 50ms here, memory retrieval becoming inconsistent there, and most concerning, users reporting that their AI companions felt "different" from one conversation to the next.

Building AI companions that remember, feel, and initiate is fundamentally different from building a stateless chatbot. When you scale AI applications with persistent memory and emotional continuity, you're not just scaling compute—you're scaling identity itself.

At Promitheus, we've spent the past eighteen months learning these lessons the hard way. Here's what we discovered scaling to 100,000 users.

The Fundamental Challenge: State at Scale

Traditional web applications scale horizontally because requests are largely independent. AI companion architecture breaks this assumption entirely. Each conversation requires memory context, personality consistency, emotional state, and temporal awareness.

Multiply these requirements by 100,000 users, each potentially having multiple companions, and you begin to understand the infrastructure challenge.

Lesson 1: Memory Storage Architecture

Our first architecture was embarrassingly naive: a single PostgreSQL database with a memories table. It worked beautifully until about 5,000 users, then query times started exceeding acceptable limits.

Per-User Data Isolation

The first critical decision was moving to per-user data isolation. Not just logical separation with user IDs, but physical separation using a sharded architecture:

  • Predictable query performance: Searching 10,000 memories for one user is fundamentally different from searching 1 billion memories and filtering.
  • Data locality: User data can be co-located geographically.
  • Privacy guarantees: Physical separation makes accidental data leakage architecturally impossible.
  • Hot/Warm/Cold Storage Tiers

    We implemented a three-tier storage system:

    Hot tier (Redis): Last 7 days of conversations, frequently accessed memories, current emotional state. Sub-10ms retrieval.

    Warm tier (DynamoDB + Pinecone): Last 90 days, all semantically significant memories. 50-100ms retrieval.

    Cold tier (S3 + archived embeddings): Everything older. 200-500ms retrieval, accessed only when specifically relevant.

    Cost Management

    We implemented aggressive deduplication, compression for cold storage, and memory summarization—reducing storage by roughly 80% while maintaining semantic richness.

    Lesson 2: Retrieval Latency

    Users perceive AI companions as slow when total response time exceeds about 3 seconds. With LLM inference taking 1-2 seconds, that leaves precious little time for memory retrieval.

    We established a latency budget: 200ms for memory retrieval, 100ms for personality application, 100ms for context construction, leaving 2.5 seconds for LLM inference.

    Parallel Operations

    Instead of sequential operations, we parallelized aggressively—memory retrieval, personality lookup, and emotional state fetch all happen simultaneously.

    Caching Strategies

    Session cache: Current conversation context, refreshed every message. No retrieval needed for continuing conversations.

    User profile cache: Core memories, personality parameters, relationship summary. Refreshed every 15 minutes.

    Embedding cache: Recent query embeddings to avoid recomputation.

    The session cache alone eliminated 60% of our retrieval operations.

    Pre-computation

    We implemented background jobs that run after each conversation, pre-computing likely next retrievals. When a user returns, their most probable memory needs are already in hot cache.

    Lesson 3: Personality Consistency

    Nothing breaks the illusion faster than a companion whose personality shifts between conversations.

    Maintaining Character

    Personality in LLM-based companions is fundamentally a prompt engineering challenge at scale. We represent personality as a structured document with immutable core traits, mutable style parameters, and conversation-specific adaptations.

    Prompt Management at Scale

    We treat personality prompts as code, with version control, testing, and staged rollouts:

  • A/B testing with 1% of users
  • Automated consistency scoring against reference conversations
  • Gradual rollout with monitoring
  • Rollback capability at the user level
  • Lesson 4: Emotional State Management

    Efficient State Persistence

    Emotional state is surprisingly compact: we represent it as a vector of ~20 dimensions plus a small set of active emotional markers.

    This state persists in Redis for active users, serializes to DynamoDB between sessions, and updates after every conversation turn.

    Batch Processing for Background Updates

    Emotional state shouldn't just reflect conversations—it should evolve with time. We run hourly batch jobs that update emotional states for all users based on time-based rules.

    Lesson 5: Cost Optimization

    Production AI at scale is expensive. At 100,000 users, unoptimized token usage can easily exceed $100,000/month in LLM costs alone.

    Token Usage Management

    We implemented strict token budgets per conversation turn. Memory context gets summarized. Personality prompts use efficient representations.

    Key insight: users can't tell the difference between 8,000 tokens of context and 4,000 tokens of well-curated context, but your costs differ by 2x.

    Tiered Service Levels

    Different users have different needs. We implemented service tiers that adjust memory retrieval depth, context richness, background processing frequency, and response length limits.

    Lesson 6: Monitoring and Observability

    Tracking Relationship Quality Metrics

    We developed composite metrics:

  • Conversation depth: Average turns per session, topic progression
  • Return rate: How often users come back
  • Emotional reciprocity: Whether users engage with emotional content
  • Memory utilization: How often retrieved memories prove relevant
  • Detecting Degraded Experiences

    We alert on metric degradation at the user level, not just aggregate. If a specific user's experience is declining, we can investigate before they churn.

    Build vs. Outsource

    Here's our honest advice:

    Build yourself: Application-specific personality design, user experience, conversation flows, domain-specific memory importance scoring.

    Outsource: Memory infrastructure, retrieval optimization, emotional state management, scaling concerns. Unless AI infrastructure is your core competency, these are solved problems.

    The companies we see succeed focus relentlessly on what makes their companions unique, while leveraging infrastructure that handles the hard scaling problems.

    Scaling AI companions is hard. But it's a solvable kind of hard—one that yields to careful architecture, obsessive measurement, and learning from others' mistakes.

    About the Author

    P

    Promitheus Team

    Engineering

    The team building Promitheus—engineers, researchers, and designers passionate about relational AI.

    Build AI That Remembers

    Promitheus provides the identity layer for AI with memory, emotion, and personality. Start building relational AI today.