Auto-tuning is not experimentation. It is the discipline AI-CRO vendors are selling around.

by , Founder & Growth Lead

Take a DTC brand on Shopify doing 100,000 monthly sessions at the median ecommerce conversion rate of 2.81% and the median order value of about $58 (both numbers from the Littledata 2026 ecommerce benchmark). The brand turns on an AI-CRO vendor across the homepage hero, the product page, the cart, and the checkout. Three months later the dashboard shows a 6% lift on the hero, 2% on the PDP, no statistically significant movement on cart or checkout. The vendor's QBR slide says the lifts compound to roughly a 4% lift in overall revenue. Everyone leaves the meeting satisfied.

Meanwhile a competitor at half the traffic runs eight customer interviews. They discover that one product detail page reads — to a non-trivial cohort — as a subscription rather than a one-time purchase, because the price-per-month framing is doing too much work. They change the copy. Their PDP-to-cart rate moves materially in a single week. They did not run an A/B test for the change; the qualitative signal was that strong. They ran a holdout afterwards to make sure they had not broken the cohort that actually preferred the subscription framing.

This is the gap the rest of this piece is about. The first brand bought a tool that runs faster guesses. The second brand ran a research discipline that produced one good guess. Both "did CRO." For the second brand, the thing that compounded was the layer the vendor's pitch never mentions — because if the operator runs that layer well, the vendor's tool is the commodity underneath.

The smuggled assumption

The auto-tuning pitch quietly assumes that the bottleneck on conversion is the combination of variants the team can think of, not the quality of those variants. At LeanBoat-buyer scale — sub-$50M DTC, single-digit-million ARR SaaS — that assumption is almost always wrong.

The two camps, named

The CRO field has been bifurcated long enough that both camps publish their position as the product.

The research-first camp publishes its methodology as the artifact. Peep Laja at Speero packages it as the Experimentation Operating System and says out loud that test design and governance is what compounds, not the tool that runs the test. Karl Gilis at AGConsult runs a research-first methodology that combines emotion-targeting, qualitative work, and rigorous test design. Talia Wolf at GetUplift wraps the same shape in emotion-first framing. André Morys at konversionsKRAFT has been publishing the same architecture from Germany for two decades. Lukas Vermeer, formerly Director of Experimentation at Booking.com, has multiple peer-reviewed papers and a public framing he calls "The Experimentation Gap" — the gap between what a team thinks they learned from a test and what the test can actually support. Lucia van den Brink said the operator version most directly on the VWO podcast: companies running tests at meaningful volume must own experimentation in-house, not rent it from agencies or auto-tuners.

The auto-tuning camp publishes the tool as the artifact. Mutiny pitches AI personalization to B2B account pages. Intellimize (now Webflow Optimize) sells a "no-A/B-test" personalization story. VWO ships multi-arm-bandit and AI-CRO modules as testing successors. Optimizely's Personalization product positions auto-targeting as the natural evolution past controlled tests. Evolv AI ships multi-arm-bandit-as-platform — the cleanest version of the pitch, which is that the math will figure out the best variant if you let it.

None of these tools are mechanically bad at what they do. The question is whether what they do mechanically is the layer that compounds at the buyer's traffic and revenue scale. That is a unit-economics question, not a marketing question.

Unit economics, where each camp is the right answer

A conversion-rate point is worth a different number of dollars at different scales. The auto-tuning pitch is implicitly a high-traffic enterprise pitch. The research-first methodology's leverage is implicitly a sub-enterprise pitch. The math is what makes the tier-locating non-arbitrary.

Worked at the LeanBoat buyer band, with the Littledata benchmark as inputs, the numbers look like this. A brand at 100,000 monthly sessions, 2.81% baseline CVR, $58 AOV: a 0.5-point CVR lift is roughly 500 additional monthly orders, about $29,000 in additional monthly revenue, around $348,000 annualized. A single half-point lift produces a mid-six-figure annual delta. At the upper edge of the band — a million monthly sessions at the same baseline — the same half-point lift is closer to a low-seven-figure annual delta. At the lower edge — 25,000 monthly sessions — it is closer to a low-five-figure quarterly delta.

This is the band the auto-tuning vendor sells into and the band the research-first methodology compounds into. It is also the band where the cost of a wrong test — running a test that cannot answer the question, or running a test on a question that was never the bottleneck — eats the entire delta. Faster bad tests is the failure mode the auto-tuning install produces here.

The math flips at very high traffic. At ten or twenty million monthly sessions, concurrent variant counts go up, the cost of running long enough for statistical power goes up faster, and a bandit that re-allocates traffic in flight is materially better than a binary winner-loser test. This is where the research-first operators stop being competitive on the execution layer and start being competitive only on the layer above it — the prior-setting layer. Auto-tuning has a tier where the install is right. That tier is one or two orders of magnitude above the buyer this post is for.

What auto-tuning skips: the design-of-experiments layer

Three pieces of recent academic work are the cleanest published versions of what auto-tuning actually does — and what it does not do. They are useful precisely because they are pro-AI: they describe the technique on its own terms, not the vendor pitch.

AgentA/B (arXiv 2504.09723) demonstrates that LLM agents can simulate users in A/B tests, with a 1,000-agent case study on Amazon.com. The paper is the strongest empirical defense of agent-based testing — and it explicitly positions agents as scaling the existing research questions, not substituting for the human research that defines them. Experimentation Accelerator (arXiv 2602.13852) shows LLMs translating ranked opportunities into concrete creative suggestions and estimating learning and conversion potential. Direct support for AI as a research-amplifier, not a research-replacement. RL-LLM-AB (arXiv 2506.06316) is reinforcement-learning-plus-LLM for automated A/B testing in personalized marketing — the cleanest published mechanical description of what the vendors are selling.

The throughline across all three: AI is genuinely useful at the ranking, simulation, and execution layers. None of the three papers shows AI generating the prior — the customer-research-grounded hypothesis the test is meant to evaluate. That layer is still human work. The operator-tier discipline that runs that layer well is what compounds across tests. Auto-tuning skips this layer because staffing it is expensive and it does not fit on a vendor's pricing page.

Where auto-tuning actually wins (the steel-man)

Auto-tuning vendors do legitimate work at very high traffic volumes where statistical power for manual tests is hard to maintain across many concurrent variants. The customer reference accounts at this tier are real. The math supports the install. The wrong-tier critique is more credible when the right tier exists.

The customer base that is not at this tier — and is being pitched the same product anyway — is the readership of this post.

What to install instead

The research-first discipline is staffable inside the LeanBoat-buyer band. It does not require a 26-year-old agency or an experimentation PhD. It requires four things, in order.

One — a research cadence the team will actually run. At the buyer band that means concrete methods, not a vendor login. Five to eight customer interviews a quarter against the same revenue-driver hypothesis. A weekly read of session recordings filtered to high-intent traffic that did not convert. A monthly scrape of support tickets, returns reasons, and survey free-text against the same hypothesis. Heatmaps and form-abandonment analytics where the conversion event is. This is the layer the vendor pitch silently substitutes the platform for; it is also the layer where the variants come from.

Two — a test-design discipline that picks one hypothesis at a time. Set a primary success metric and a guardrail metric. Decide up front what changes after the test — copy, layout, removal, addition — and what the team will do differently under each outcome. Variants come from the research layer above, not from a brainstorm. Vermeer's "Experimentation Gap" framing is the cleanest read on what happens when this layer is missing: the team accumulates tests, not learnings.

Three — a statistical floor on what the team will call a result. Pre-declared sample size. Pre-declared significance threshold. Pre-declared test duration window. This is the layer auto-tuning vendors are pitching as obsolete. It is not. Ron Kohavi has been writing the canonical work on this for fifteen years (Trustworthy Online Controlled Experiments); the operator-tier methodology assumes its conclusions and the venture-backed vendor pitch quietly assumes its readers do not know they exist.

Four — an automation layer that executes faster. Only here does the AI tooling layer earn its keep. Test deployment, variant generation against an already-researched hypothesis, lift detection against the pre-declared threshold. The vendors are at the right layer for their tooling. They are wrong about the layers above being optional.

This is what we install when a brand asks us to put AI on top of a CRO function. The install is not "we add the AI vendor." The install is the four-layer discipline above, with the AI tooling at layer four, and a research cadence at layer one the team will keep running after we leave.

Closing

The discipline the auto-tuning vendors are selling around is the discipline that earns the lift. The tool at layer four runs better when the work at layers one, two, and three has been done. It runs worse — and faster — when it hasn't. Faster bad guesses at scale is the failure mode of a stack that buys layer four and skips the rest.

Book a strategy call →

More articles

The line between automated marketing and AI-native marketing is a feedback loop

We're moving TheNextGuide's content engine from automated to AI-native. The line isn't the model — it's the feedback loop. Here's the seven-step install at a tourism marketplace we operate, the structural difference named, and four questions to audit any AI marketing tool in your stack.

Read more

Three signals your B2B SaaS marketing budget is funding the wrong half of the funnel

Most B2B SaaS teams fund the half of the funnel their attribution dashboard can see and underfund the half it can't. Three signals you're on the wrong side of the flip, the AI-native rebuild of Walker's demand-creation framework, and a 30-day on-ramp. Series A (AI-Native Demand Creation), post 1 of 4.

Read more

Tell us about your project

Our offices

  • Cascais
    Rua do Cabo 6
    2755-6669 Cascais, Portugal
  • Rio de Janeiro
    Honório de Barros 12
    22250-120, Rio de Janeiro, Brazil