Validated against reality: Tested the engine's predictions against 3 published real-world A/B tests before building the full product. Called all 3 correctly. That's the only number that matters.
The Problem
Email marketers have always tested subject lines the same way: split the list, send both versions, wait a few hours, declare a winner. It works. But it comes at a cost:
- You've already sent to 20–50% of your list before you know which version wins
- Small lists can't split further without losing statistical significance
- The feedback loop takes days, not minutes
- Every losing variant burns real engagement from real subscribers
The deeper issue is that marketers never fully develop intuition. The feedback is so slow and noisy that it's almost impossible to connect what you wrote to why it worked. You're flying on gut feel with occasional data confirmation, and the data arrives too late to be useful.
The Idea
What if you could simulate your audience before hitting send?
Instead of testing on real subscribers, the engine generates 50 AI personas calibrated to your audience type (newsletter readers, DTC shoppers, SaaS users) and runs each one through a multi-round decision process that mirrors how people actually behave in their inbox.
The output isn't a score or a keyword grade. It's a predicted open rate with a confidence interval, plus the actual reasoning each persona used. You see which personas opened and why, which ones skipped and what put them off, and what a rewritten subject line might do differently.
You get all of this in under 60 seconds, before a single email is sent.
How It Works
The simulation runs in five stages, each one building on the last.
Stage 1: Persona Generation
50 audience personas are generated in parallel, each with distinct traits: inbox habits, content preferences, skepticism levels, and decision patterns. Before generation, the engine queries its calibration database (real open rates from previous campaigns with the same audience type) and anchors the personas to realistic baseline behaviour rather than LLM guesses.
Stage 2: First Glance Decision
Each persona sees the subject line for the first time and makes an initial call: open, skip, or maybe. "Maybe" personas (roughly 15-20%) are in the middle of something else. They get pulled back into the simulation four simulated hours later and make a final decision with the email still sitting unread in their inbox. This mirrors how real inbox decisions actually happen.
Stage 3: Deep Engagement
Personas who opened get a second round. They see preview text and a snippet of the email content, then decide: click through, skim and close, or unsubscribe. This stage captures whether the subject line over-promised or actually matched what was inside.
Stage 4: Report & Verification
A report agent synthesises all persona reactions into a predicted open rate and a statistical confidence interval. If the prediction deviates significantly from the calibration baseline (say, 50% for a cold outreach list where 10-15% is strong), the agent runs a second pass with the historical data injected as an anchor, forcing it to justify any large deviation from real-world norms.
Stage 5: Quality Scoring (background)
After the result is returned to you, a background job evaluates the simulation quality on four dimensions: realism, persona diversity, reasoning coherence, and calibration accuracy. These scores accumulate over time and surface systematic biases before they affect predictions.
What Makes It Different
| Approach | What you get |
|---|---|
| Keyword scoring tools | A grade based on word lists and sentiment rules, no audience model |
| Traditional A/B testing | Real data, but you've already sent to half your list |
| This engine | A predicted open rate from a simulated audience, with reasoning and confidence range, before you send anything |
The key difference is that this doesn't score the subject line in isolation. It simulates how a specific audience type actually behaves when they see it, in the context of a real inbox with competing emails and limited attention.
The Validation
Before writing a single line of product code, I tested the core prediction against three published real-world A/B tests, cases where the winning subject line was already publicly known.
The simulation ran blind: given only the two subject lines and the audience type, it had to pick the winner.
It called all three correctly.
This isn't proof of generalisability. Three tests is not a statistically significant sample. But it's proof that the underlying mechanism works: simulating audience psychology produces predictions that track reality, at least in these cases. The calibration corpus is how that proof becomes statistical as the product runs and real campaign outcomes are submitted.
Tech Stack
Backend: Python, FastAPI, ARQ (background job queue), Redis, PostgreSQL, asyncpg
AI / LLM: Claude (Anthropic) and OpenAI GPT with a provider abstraction layer; Pinecone for semantic memory of past simulations; LangChain for orchestration
Frontend: Next.js, TypeScript, Tailwind CSS, Server-Sent Events for live persona reveal during simulation
Infrastructure: Docker for local development, AWS for production, Supabase for managed Postgres
Status
Live and running. The calibration corpus grows with each campaign that submits actual results after sending. The longer it runs, the more accurate the baseline anchoring gets, and the harder it becomes to replicate, because the data compounds.