Your AI conversion tool now writes the variant. That changes what the test is measuring.

by , Founder & Growth Lead

You open the conversion tool on Monday, type one sentence about the page you want to lift, and go get coffee. By the time you're back, it has written forty headlines, a dozen CTAs, three new hero sections — and it's already showing them to live traffic, quietly keeping score. Nobody on your team wrote a word of it. The pitch on the box was exactly this: "Auto-generate headlines, CTAs, and full-page content." That's Coframe, and it's not alone. It feels like the cleanest leverage you've bought all year.

It's also the moment your A/B test stopped doing the job you think it's doing. When the same tool writes the variants and grades them, the test quietly turns from a question you asked into a machine picking its own favorite — and you ship the favorite because it "won." This piece is about what that costs, and the three things to lock before you turn the tool on.

Two kinds of AI conversion tool, and the pitch hides which one you bought

There's a split running through the whole category right now, and most buyers can't see it because every tool says "AI" on the homepage.

We read a dozen of them this week — the live homepages and product pages, quoted as they stand. Two camps fell out, and they are not doing the same thing:

  • The tools that write your variants. A newer, AI-native cohort: Coframe ("we generate copy, code, and visuals — you review and approve what goes live"), Instapage ("create headlines, page text, CTAs… for A/B testing in seconds"), and Fibr, another AI-native entrant pitching the same generate-and-test loop. You give a prompt; the model authors the copy and pushes it into the test.
  • The tools that run your variants. The established experimentation platforms — Optimizely, AB Tasty, Dynamic Yield, Webflow Optimize, OptiMonk, at least the surfaces we checked — market their AI as an optimizer: it personalizes, it routes traffic, it picks audiences, it flags anomalies. A human still writes the copy. The machine grades it.

That second camp is the honest tell. If writing the variant were obviously a solved, safe thing to hand a model, the most sophisticated conversion-tooling companies in the world would be racing to advertise it. They mostly aren't. Even Unbounce's copy product calls itself an "AI copywriting assistant" that helps you "rewrite, expand, and summarize" — it refines a line you wrote, on purpose, rather than authoring the variant from scratch.

So letting the tool write your variants is a choice you're making, not the water you have to swim in. And like most choices that feel free, the bill is itemized somewhere you're not looking.

The self-graded winner

Here's what changes the instant one tool both writes the copy and runs the test that picks the winner.

A test is supposed to be a question with a referee. You write two answers, real traffic votes, the referee tells you which answer won. The referee's whole value is that it's neutral — it didn't write either answer, so when it raises a hand, you learn something.

When the same system writes all forty answers and also referees them, the referee isn't neutral anymore. It's grading its own homework, at scale, and handing you the top of its own stack with a number attached. The winning line didn't earn its way past your judgment. It earned its way past forty other strings the model guessed, which is a different and much lower bar.

Call it the self-graded winner: a machine-written line that ships not because anyone decided it was right, but because it beat the other machine-written lines, and "it won the test" is the cleanest-sounding reason in marketing to put something live. The data badge does the rest. A line you'd have killed in a review sails through, because it arrived wearing a green arrow instead of a first draft.

You didn't approve the copy. You approved the process. And the process now writes the thing it grades.

Whose question is the test even asking?

Strip the data badge away and a quieter loss shows up underneath: you stopped asking a question at all.

The point of a conversion test was never the lift. It was the reason for the lift — the thing you learned about your customer that you didn't know on Monday. That only exists if someone declared, before traffic split, what they believed: this segment is bouncing because the offer reads as risky, so we'll lead with the guarantee, and we expect add-to-carts to move. Win or lose, you learn whether your read of the customer was right.

A generate-and-test tool skips that. It didn't have a belief. It had a prompt and a sampler. "Which of my forty strings got the most clicks" is a real result, but it's not an answer to anything you asked — because you didn't ask. (This is the cousin of a trap we named in an earlier column: there, the tool wrote your hypothesis; here it writes the copy underneath one you never bothered to form. Same disappearance, one layer down.)

Forty headlines is plenty. What's missing is the one sentence above them — the belief the test was supposed to check.

A winner you can't read back

Say a variant wins clean — real traffic, real margin, no funny business. You still hit two costs the dashboard won't print.

  • You can't read why it won. A winning line a human wrote is legible: you know the bet, so the result teaches you something you carry into the next test. A winning line the model wrote tells you a string beat other strings. Was it the urgency? The reframed offer? A claim you can't actually keep? You don't know, because nobody formed the thought — so the win doesn't compound. It's a lift you bank once and can't reuse.
  • It optimizes for the click, not for how you sound. The model is pointed at one number, and your brand voice isn't in it. We've written before about AI marketing copy landing "roughly on-brand — also generic." The test makes that worse, not better: now the off-voice line doesn't just get drafted, it wins, and winning is the one credential that overrides a human's "that doesn't sound like us." The flattest, most aggressive phrasing often does test well for a week. Ship enough self-graded winners and the site slowly stops sounding like the company and starts sounding like a model chasing clicks — one green arrow at a time.

Neither of these shows up in the QBR — the quarterly business review where the vendor walks you through the numbers. The QBR shows the lift. The costs are the two things the lift quietly cost you.

The flow — what to lock before you let it write

To be clear about the steelman: handing the build of variants to AI is the right call, and we've said so. A team hand-coding forty headlines in 2026 is paying artisan prices for commodity work. The research loop doesn't change shape because the tool got a writing hand. What changes is one boundary — and you set it before you turn the tool on, not after the first self-graded winner ships.

The classic flow. You (or your copywriter) wrote the variants. The tool split the traffic and named a winner. The copy was yours; the test merely graded it. Slow, but the referee was neutral and every winner was legible.

The AI-native version. Let the tool write — but lock three things first, so the test stays a test:

  • Write the hypothesis yourself; let the tool fill variants under it. Pre-declare the one paragraph — who's stuck, on what, what you're changing, what number should move. Then the generator can write its forty headlines, because now they're forty answers to your question, not forty guesses in place of one. The model brings the volume; you bring the belief it's testing.
  • Hand the generator a voice boundary it can't cross — not a suggestion. A short, hard list: the claims legal won't allow, the phrase the founder killed last quarter, the three things the brand never says, the register it has to stay inside. The model optimizes for the click and knows none of this unless it's written down where it reads before it writes. This is the locked-rules file doing its actual job.
  • Gate every winner before it becomes site canon. A green arrow is a candidate, not a decision. Twenty minutes: read the winning line, say out loud why it might have won (offer? urgency? a promise you can't keep?), check it against the voice boundary. The tool surfaces the winner; a human ships it — the same surface-don't-gate shape we use everywhere else AI proposes and a person signs.

Where it breaks. Two ways. Refuse the generation entirely and you're paying by hand for work the machine now does for cents — that's the mirror-image mistake. Or keep the three locks as theater: a hypothesis nobody writes, a voice list nobody updates, a winner nobody reads before it auto-publishes. Locks you don't close aren't locks.

Install note. In our per-brand work this is one new check in the conversion folder, not a new tool: any machine-written variant carries the brand's voice boundary as a hard input, and no generated winner ships without the read-back. The generator is welcome. It just doesn't get to write and grade and publish unwatched.

Let it type. Keep the judgment.

Hand over the typing — that part really did get cheap. Just don't hand over the part where you decide what's true about your customer, because a test that writes its own answers will happily tell you that you were right about a question you never asked.

More articles

Your About page isn’t copy anymore. It’s training data.

Type your company into ChatGPT and a model hands your buyer a description you never wrote. An llms.txt won’t fix it — your ordinary pages are the training data. State what you are in checkable terms.

Read more

What a week of AI-native demand creation actually looks like

We run an AI-native demand-creation engine on TheNextGuide. Here's the real operating week — what it makes, where it lands, and the one thing the team approves that turns a content calendar into a system that compounds.

Read more

Tell us about your project

Our offices

  • Cascais
    Rua do Cabo 6
    2755-669 Cascais, Portugal
  • Rio de Janeiro
    Honório de Barros 12
    22250-120, Rio de Janeiro, Brazil