Two months ago, I blocked off a Friday afternoon to break our persona engine. Not fix it. Break it. I wanted to know what happens when you push past the limits we'd designed for and into territory where things get weird.
The plan: create 10,000 synthetic buyer personas in a single cohort, give them all memories and behavioral profiles, and then run them through a simulated purchasing decision. A full-scale synthetic market.
The result: about 60% beautiful success and 40% fascinating failure. Both halves were useful.
The Setup
Our typical customer creates between 5 and 200 personas for a research project. Focus groups run 5-12 participants. Campaign panels rarely exceed 50. So 10,000 was deliberately absurd, about 50x our largest production workload.
I defined the cohort as B2B SaaS buyers across four segments: small business owners, mid-market IT directors, enterprise procurement leads, and startup CTOs. Each persona got a generated name, company context, purchase history, pain points, and behavioral tendencies. The goal was heterogeneity: 10,000 personas that actually feel different from each other.
Infrastructure: eight GPU-backed generation nodes, a 256GB Redis cluster for persona state, and Postgres for persistent storage. I gave the system 48 hours to generate the full cohort.
What Worked
Generation throughput was fine. Our parallelized generation pipeline handled the volume without drama. We sustained about 40 personas per minute across the cluster, completing the full 10,000 in roughly 4 hours. The queue management worked as designed: batch, dispatch, retry on failure, write to storage.
Demographic diversity held up. I ran distribution analysis on the generated cohort and the demographic spread matched our target parameters closely. Age, role seniority, industry, company size, geographic distribution all landed within 3% of specifications. The stratified sampling approach we built last year is doing its job.
Individual persona quality was solid. I spot-checked 200 personas across all four segments. Backstories were coherent, behavioral profiles were internally consistent, and the personas had genuinely different perspectives when I probed them on the same question. Persona #4,721 (a skeptical procurement director in healthcare) gave substantively different answers than Persona #4,722 (an enthusiastic startup CTO in logistics). At the individual level, the system works even at scale.
Where It Broke
Memory pressure was the first wall. Each persona's context window, including their biographical data, memories, and behavioral model, averages about 3,200 tokens. At 10,000 personas, that's 32 million tokens of state that needs to be accessible for any query. Our Redis cluster started swapping to disk at around persona 7,000, and by 9,000 the memory retrieval latency had gone from 12ms to 340ms. Not a crash, but a 28x degradation that made the system practically unusable for interactive queries.
The fix was embarrassingly simple: tiered storage. Hot personas (recently accessed) stay in Redis. Warm personas get compressed and stored in a secondary cache. Cold personas go to disk with an async prefetch when they're about to be needed. We should have built this from the start.
Quality degradation in dense segments. Here's the interesting one. When you generate 2,500 startup CTOs, they start sounding the same. Not identical, but the behavioral range narrows. By persona 1,800 in the CTO segment, I was seeing the same metaphors, similar anecdotes, and convergent opinions on pricing. The diversity that looked great at persona 200 was flattening by persona 2,000.
This is a model saturation problem. The underlying LLM has a finite distribution of "startup CTO" archetypes, and once you've sampled enough times, you've covered most of the probability mass. The remaining personas are interpolations between archetypes you've already generated.
Our fix: a diversity enforcement layer that tracks generated traits and deliberately pushes new personas toward underrepresented combinations. If we've already generated fifteen "privacy-focused CTOs who previously worked at Google," the system increases the probability of generating a "growth-obsessed CTO who came from manufacturing." It's not perfect, but it extended the useful diversity range from about 1,800 to roughly 4,000 per segment.
Campaign execution at scale was brutal. Running all 10,000 personas through a purchasing decision simulation took 14 hours. The bottleneck wasn't generation or storage; it was the sequential nature of the decision simulation. Each persona needs to "think" about the product, weigh factors, and produce a response. You can parallelize across personas, but each individual response requires a full LLM inference pass. At $0.003 per persona-decision, the full run cost $30. Not terrible for an experiment, but it means a real customer running iterative campaigns at this scale would burn through budget fast.
We're exploring batched inference and distilled decision models for the simulation step. Early experiments suggest we can get 80% of the decision quality at 10% of the inference cost by using a smaller model for routine decisions and escalating to the full model only for edge cases.
The Surprise Finding
The most useful thing I learned had nothing to do with infrastructure. At around 5,000 personas, aggregate patterns in the simulated market became statistically robust in ways that small panels can't achieve. I could segment the cohort by any combination of demographics and still have enough personas per cell to draw meaningful conclusions.
"What do skeptical mid-market IT directors over 45 in healthcare think about usage-based pricing?" With a 50-persona panel, you might have 2-3 people matching that filter. With 10,000, you have 80+. The research resolution increases dramatically.
This matters because real market research is always making tradeoffs between breadth and depth. You can interview 30 people deeply or survey 3,000 people shallowly. Large-scale synthetic cohorts potentially give you both: deep individual responses from thousands of participants, queryable by arbitrary filters.
What We Changed
Based on this experiment, we made four changes to the production system:
- Tiered memory storage with hot/warm/cold layers, active for all cohorts over 500 personas
- Diversity enforcement that tracks trait distributions and pushes for underrepresented combinations
- Cost estimation in the campaign creation flow so users know what a large run will cost before they commit
- Progress streaming for long-running campaigns so you can watch results arrive instead of waiting for the full batch
We also set a soft limit at 5,000 personas per cohort for now. You can go higher with an explicit override, but the diversity degradation above 5,000 means the marginal value of additional personas drops off. Better to run two diverse cohorts of 5,000 than one homogeneous cohort of 10,000.
Should You Do This?
Probably not at 10,000. For most research questions, 50-500 personas gives you excellent coverage. But if you're doing broad market sizing, testing pricing across many segments simultaneously, or building a persistent synthetic panel that you'll query over months, larger cohorts start making sense.
The system can handle it now. It just took breaking it first to make sure.