Pulp AI
A simulation platform that lets founders and product teams test copy, pricing, and features against ~1,000 AI-generated customer personas, shipping micro-decisions in minutes instead of weeks.
The Problem
Founders make a hundred decisions a week, and research can't keep up
Building from the ground up, I kept running into the same trap. Decisions were either anchored too heavily to my own way of thinking, or taken as shots in the dark off a handful of responses from a narrow slice of the market.
Talking to other founders surfaced a pattern. You have to move fast and make a lot of calls, and there is rarely the money or the time to do deep, meaningful market research for every important micro-decision, whether that is the copy, the price point, or the framing of a feature.
The people who felt this most acutely were founding teams and PMs who needed a clear, intuitive read on how their target demographic would react to a product change, frequently and quickly.
Their alternatives were user interviews and surveys, which are slow to run and rarely wide enough to capture the full diversity of the audience. Pulp AI set out to compress that loop from weeks to minutes without giving up the diversity of perspective that makes research worth doing.
The Wedge
Why a population of personas beats asking one model
Prompting a single LLM with "what would users think?" flattens every perspective into one averaged voice that is directionally vague and easy to over-trust.
Single prompt
1 averaged voice
One blended answer that hides disagreement and gives no quantitative read on who reacts how.
Pulp AI
~1,000 distinct agents
Each persona reasons independently, then reactions are clustered into segments, yielding richer insight with quantitative directionality.
Worked example Should the Pro tier launch at $29 or $49 per month?
Ask one model
"It depends on your audience. $39 could be a sensible middle ground that balances accessibility against perceived value."
Plausible and instantly forgettable. No read on who wants what, and no number you can act on.
Ask a population
64% lean toward $29, but the quarter of buyers who are enterprise read $49 as a quality signal and disengage at the lower price. Price-sensitive SMBs anchor hard on $29.
Ship $29 as the entry point and keep $49 as a clearly positioned upper tier. A decision, backed by the segments that drove it.
How it works
From a population seed to a segmented verdict
One run takes a product description and a test, synthesizes a matching population, and returns how each market segment will respond.
Define the simulation
The founder describes their product and target demographic, then picks a test type (A/B copy, pricing sensitivity, feature framing) and supplies the variants.
Synthesize the population
The persona generator expands the population seed into ~1,000 heterogeneous agents matching the target distribution. Existing matching profiles are reused via pgvector, and gaps are synthesized and cached.
Agents deliberate
Each persona runs a bounded reason → act → observe loop, calling tools mid-flight and finalizing a validated decision that contains a choice, free-text reasoning, and a confidence score.
Aggregate deterministically
A weight- and confidence-adjusted tally produces the winner, margin, and overall confidence, with no LLM judge, so the headline number is reproducible.
Cluster & narrate
Reactions are clustered into market segments and a single narrative pass writes the insight and recommendation, so the founder sees how each segment will respond.
Architecture
A modular AI layer over a three-tier core
A React frontend app talks to a FastAPI backend that owns the entire run lifecycle, from persona creation through test coordination, agent execution, scoring, and the final summary. Persistent state is split across three stores. PostgreSQL holds relational data, pgvector holds persona embeddings that power dedup and demographic-consistent reuse, and Redis handles sessions and caching. Every model call flows through one provider-agnostic LLM adapter, with concurrency bounded to respect provider rate limits.
Request path
Test builder, live run view, and the segment-results dashboard.
Owns the full run lifecycle end to end.
Modular AI layer
Expands a population seed into N heterogeneous agents with bios, traits, and behavioral weights.
Drives each agent's bounded reason → act → observe loop and validates the submit() decision.
Weight- and confidence-adjusted tally, then a single narrative pass. No LLM judge in the loop.
Persistence
Products, test definitions, runs, and per-agent results.
Similarity search to dedupe agents and recycle profiles that match new demographics.
Session state and hot-path caching across a run.
Engineering
Key technical decisions
Extracting one JSON answer from a prompt is trivial. Orchestrating a thousand asynchronous, multi-turn agents into a validated, reproducible verdict, without one bad agent corrupting the whole, was the real work.
Persona generation
Synthetic populations from a single seed
Rather than static research data or hand-authored profiles, a generation pipeline expands a population seed, the simulation's ICP, into N heterogeneous agents, each with a bio, traits, and behavioral weights drawn from free-text hints or predefined cohorts.
Consistency
Identity anchoring & deterministic replay
A frozen system prompt re-injects each agent's core demographic profile into every model call to prevent persona drift. In the offline harness, outputs are seeded by a hash of the system prompt, so any given agent is logically repeatable while the population stays diverse.
Agent design
Bounded reason → act → observe loops
Every persona is a true multi-step agent, not a single call. It can invoke tools, fold observations back into context, and iterate up to a step budget before finalizing through a synthetic submit tool whose schema is the persona's decision model.
Reliability
Provider-agnostic, schema-validated output
The submit tool's JSON schema is derived from a Pydantic decision model and inlined so enums survive across Anthropic and OpenAI. Validation failures trigger budgeted retries rather than letting malformed data persist, with no vendor lock-in above the adapter layer.
Fault tolerance
Three-tier isolation across the population
Tool errors become observations, step-limit breaches fall back to low-confidence results, and catastrophic agent crashes are encapsulated in a result container. One malfunctioning agent never corrupts the aggregate of a thousand.
Aggregation
Deterministic scoring, then one narrative pass
No LLM judge. A weight- and confidence-adjusted tally produces the winner, margin, and confidence, so the headline number is reproducible. A single narrative pass then writes the insight and recommendation on top.
The per-agent loop
Deliberate
Reason over the variant
Act
Call a tool (calc, research)
Observe
Feed result back as context
Finalize
submit() → validated decision
Demo
Run a simulation
Pick a test, press run, and watch a synthetic population react, then see the aggregated verdict and segment breakdown. This is a stylized mockup with pre-computed results. The product ran this live against a freshly synthesized population each time.
Illustrative mockup with pre-computed outputs. The production system ran live against a freshly synthesized population per test.
Validation
Back-tested against real outcomes
Across 10 early-stage pilots, the most consistent signal was speed. Micro-decisions that used to take weeks of interviews resolved in minutes. To sanity-check the predictions, we replayed historical A/B tests through the simulator.
Winning-variant accuracy · back-test over 10 historical A/B tests
20 percentage points above chance at calling the winner, plus directional lift, on a small historical sample.
On a small set of 10 historical A/B tests, the simulator picked the winning variant 70% of the time, meaningfully above the 50% coin-flip baseline, and recovered the directional lift. A promising early signal rather than a validated benchmark, given the sample size.
Learnings & what's next
Fun to build, hard to make cheap
Engineering a system like this is genuinely a thrill, orchestrating a thousand reasoning agents into one coherent answer is a satisfying problem. The hard part isn't getting it to work. It's getting it to work inside an optimal cost window without sacrificing the efficiency and fidelity that make the output worth trusting.
That tension between cost and quality is the core lesson I'm carrying forward. The plan now is to open-source the core concept as a library, so anyone can spin up persona simulations for their own decisions and tune the cost/fidelity trade-off for their use case.
The core engine is going open-source
A library version of Pulp AI's persona-simulation core is on the way. Want a heads-up when it lands, or to swap notes on agent orchestration? Reach out.