Column

A/B Testing That Generates Its Own Next Round — Field Experiments with AI Agents

An AI agent learned from 700K+ patient visit field experiment data and auto-generated 17 new message variants for round two. The top AI-generated message hit 69.8% CTR, outperforming expert-designed messages. The key finding: LLMs without real data couldn't predict effective interventions. Data is what makes the loop work.

2026 / 06 / 03 5 min read 日本語版 →

Abstract visualization of field experiment data being learned by an AI agent that autonomously generates the next round of intervention messages

Hi, I’m Keito Inoshita from Affectosphere Group.

Most A/B testing ends the same way. The data comes in. Someone analyzes it. Someone writes up what worked and what didn’t. Someone proposes the next variant. The cycle restarts with a human in the loop at every step.

What would change if the “propose next variant” step ran autonomously?

A paper published on arXiv in June 2026 (Junjie Luo, Ritu Agarwal, Gordon Gao et al., arXiv:2606.02458) ran exactly this experiment in a healthcare setting. An AI agent autonomously learned from a field experiment involving approximately 700,000 patient visits, analyzed what distinguished effective from ineffective messages, and generated 17 new variants for round two. The top AI-generated message achieved a 69.8% click-through rate — outperforming the best message from the expert-designed first round.

And the most important finding from the study isn’t the 69.8%.

3 Points for Today

Value: An AI agent that learns from field experiment data can close the loop — generating the next round of interventions autonomously, without waiting for a human to analyze and re-hypothesize.
The critical insight: LLMs without real experiment data couldn’t predict effective interventions. The data is what makes the learning work.
Application: How marketing, CRM, and patient engagement teams can design a continuous improvement loop using this framework.

① What the Two-Stage Design Did

The research context is a healthcare patient engagement program — specifically, getting patients to follow through on clinic visits and recommended screenings by improving reminder message design.

Stage one: medical experts and AI collaborated to design multiple message variants, which were deployed as a real field experiment across the patient population. This is conventional A/B testing territory.

Stage two: an AI agent autonomously processed and analyzed the results of stage one — not just the winner, but the entire pattern of what worked and what didn’t across variants — and used that learning to generate 17 new candidate messages for the next experiment round.

The AI wasn’t given access to a literature review or a marketing playbook. It learned from the experimental outcomes and generalized from them.

② The Result and the More Important Finding

The top AI-generated message from stage two reached 69.8% CTR, exceeding the best stage-one message.

But the finding the researchers emphasize is different, and more practically useful.

The team also tested what happens when you ask the same LLM to predict effective interventions without the field experiment data — the zero-shot condition. The result: the LLM without real data couldn’t generate predictions that corresponded to effective interventions.

This is the critical practical insight. “AI can replace A/B testing” is not what this paper shows. “AI that has learned from real experimental data can accelerate the next round of experimentation” is what it shows.

The lesson: data collection is not optional. The loop only works because stage one generated real outcome data. An LLM operating on priors alone won’t produce the same effect.

③ How Marketing and CRM Teams Can Use This

Translating the research into operational design.

Email marketing: autonomous next-round generation

Email subject line and body A/B test results are processed by an AI agent after each campaign round. The agent learns what distinguished high-engagement variants from low-engagement ones, and proposes the next batch of subject lines to test.

The workflow change: the marketing team’s role in the loop shifts from “analyze results, re-hypothesize, draft new variants” to “review AI-generated proposals, approve, and deploy.” The bottleneck between rounds shrinks.

Department and role mapping: owned by email marketing or CRM teams. Measurable outcome: number of improvement rounds completed per quarter × conversion rate trajectory.

Push notifications and appointment reminders

The healthcare context in this paper transfers most directly to: hospital and clinic digital outreach teams, health app engagement managers, and preventive care program operators. Appointment reminder optimization, vaccination nudges, and screening follow-ups are contexts where the target behavior is measurable and the message is the lever.

These are ideal starting conditions for this framework — clear conversion metric, defined audience, repeated intervention opportunity.

What to watch out for: data scale and conversion depth

The paper used approximately 700,000 patient visits — a large dataset. Smaller datasets will yield less reliable learning, and the appropriate scale for deployment in your context requires its own validation.

Also: what the AI learns from depends on what you’re measuring. If the conversion metric is click-through rate rather than downstream behavior (actual visit, actual purchase), the loop may optimize for superficial engagement. Defining the right outcome variable before building the loop matters more than the AI architecture.

The Paradigm: AI as an Experimental Learner, Not a Predictor

The design philosophy here is distinct from “use AI to predict the optimal intervention.” That framing asks the AI to generate good answers from priors. It doesn’t work, as this paper’s zero-shot condition shows.

The working paradigm is: use AI to learn from real experimental outcomes and accelerate the next round of hypothesis generation. The AI’s value is in the loop, not in the oracle.

“AI that makes A/B testing unnecessary” misframes the contribution. “AI that makes A/B testing iterate faster and generate better hypotheses” is closer to what this paper demonstrates.

That’s the use case worth building toward.

That’s all for today!

References

Junjie Luo, Ritu Agarwal, Gordon Gao (2026). Beyond One-shot: AI Agents for Learning in Field Experiments. arXiv preprint.

* This article was written in part with AI assistance and may contain inaccuracies.