Back to BlogEducation

5 Common Incrementality Testing Mistakes (And How to Avoid Them)

Most incrementality tests fail because of avoidable mistakes. Here are the 5 most common errors and exactly how to fix each one.

Go Funnel Team6 min read

Incrementality testing only works when the test is set up correctly

The logic behind incrementality testing is sound: compare a group that sees ads to a group that doesn't, and measure the difference. But between that simple concept and a reliable result, there are several places where things go wrong.

We've reviewed over 200 incrementality tests across agency clients and in-house teams. More than half produced unreliable results due to one of five recurring mistakes. Here's what goes wrong and how to fix it.

Mistake 1: Running underpowered tests

This is the most common error. An underpowered test doesn't have enough data to detect a real effect, even when one exists. The result: your test shows "no significant lift," and you conclude the channel doesn't work -- when in reality, your test just couldn't see the lift.

How it happens

A brand spending $80K/month on a channel sets up an incrementality test with a 5% holdout for 2 weeks. The holdout group generates 47 conversions. The treatment group generates 912 conversions. There's a 14% difference in conversion rates, but with only 47 holdout conversions, the confidence interval is so wide that the result isn't statistically significant.

The team concludes the channel has "no proven incrementality" and cuts the budget. In reality, the channel was probably driving a real 10-15% lift, but the test was too small to prove it.

How to fix it

Run a power analysis before launching the test. You need to know:

  • Your expected baseline conversion rate (from the holdout)
  • The minimum lift you want to detect
  • Your desired significance level (0.05 standard) and power (0.80 minimum)

The power analysis tells you the sample size you need. If it says you need 200 holdout conversions and your test design will only produce 50, either increase the holdout percentage, extend the duration, or choose a higher-volume campaign to test first.

A practical minimum: aim for at least 100 conversions in the holdout group. Below that, even a 30% lift may not reach significance.

Mistake 2: Contaminating the holdout group

Holdout contamination means the "no ads" group actually saw some ads. When that happens, the holdout group's conversion rate is artificially inflated, which makes the measured lift smaller than the true lift.

How it happens

Geographic leakage: In geo tests, ad platforms don't perfectly enforce geographic boundaries. DSPs may serve ads to users who live in a holdout DMA but work in a treatment DMA (based on IP). Mobile geo-targeting can be inaccurate by 10-20 miles near DMA boundaries.

Cross-platform exposure: You're running an incrementality test on Meta, but your Google campaigns are still running everywhere. Users in the Meta holdout group are still seeing your Google ads, which can influence their conversion behavior.

Retargeting overlap: Users in the holdout group who previously visited your site may still receive retargeting ads from separate campaigns that weren't included in the test setup.

How to fix it

For user-level tests, use the platform's native tools (Meta Conversion Lift, Google Conversion Lift). They handle suppression correctly.

For geo tests:

  • Create a 25-mile buffer zone between treatment and holdout DMAs and exclude those buffer zones from analysis
  • Pause all paid media in holdout markets, not just the channel being tested (or at minimum, document which other channels are running)
  • Audit delivery reports daily during the test to catch leakage

For any test, document all other marketing activities running during the test period. If your email list includes holdout-market customers, note it. If you have national TV running, note it. These don't invalidate the test, but they should factor into your interpretation.

Mistake 3: Testing during unstable periods

Incrementality tests assume that the only meaningful difference between treatment and holdout groups is the presence or absence of ads. When external factors differentially affect the groups, the results are compromised.

How it happens

Seasonal disruption: Running a test during Black Friday, Prime Day, or a major sale event. Consumer behavior during these periods is fundamentally different from normal periods. A test that shows 40% incremental lift during Black Friday may show 10% during a normal month.

Promotional overlap: Launching a new product, running a flash sale, or sending a major email blast during the test. These events boost conversion rates in both groups but can interact unpredictably with your ad exposure.

Competitive disruption: A competitor launches a major campaign or goes out of business during your test period, shifting market dynamics.

How to fix it

  • Run tests during "normal" business periods when possible
  • If you must test during a promotional period, plan for it -- extend the test to include both promotional and non-promotional days
  • Document every external event that occurs during the test and assess whether it could differentially affect treatment vs. holdout
  • Run a pre-test observation period (2-4 weeks) to verify that treatment and holdout groups track together before the test begins. If they diverge during the pre-test, your market matching or randomization needs work

Mistake 4: Ending the test too early

Also known as "peeking." This happens when someone checks the results halfway through, sees an interesting number, and either stops the test or makes decisions before reaching the planned duration.

How it happens

Day 5 of a 14-day test. The data shows a 22% lift with 92% confidence. The team gets excited: "We don't need to wait two more weeks -- the results are already significant!"

The problem: statistical significance calculated on partial data is unreliable. When you check results multiple times during a test, each check increases the probability of a false positive. Running 7 interim checks on a 14-day test at the 95% significance level gives you an actual false positive rate of roughly 20%, not 5%.

This is the multiple comparisons problem, and it catches even experienced data scientists.

How to fix it

  • Set the test duration before launching and commit to it
  • Don't look at results before the planned end date
  • If you must monitor the test (for quality assurance, not for decision-making), use a sequential testing framework that adjusts for multiple looks. These are available in most A/B testing platforms but need to be applied to incrementality tests as well
  • Pre-commit to your analysis plan: what metric, what significance threshold, what minimum sample size. Write it down before the test starts

Mistake 5: Testing the wrong thing

The test runs perfectly. Statistical power is adequate. No contamination. Full duration. Significant results. But the test answered a question you didn't actually need answered.

How it happens

Testing at too high a level: You test "does Meta work?" and find a 20% incremental lift. Useful, but it doesn't tell you whether prospecting, retargeting, or brand campaigns on Meta are driving the lift. All three might have very different incrementality profiles.

Testing at too low a level: You test whether one specific ad creative is incremental. It shows no significant lift. But the creative was only running to a small audience, so the test couldn't detect a meaningful effect. You needed to test the campaign, not the creative.

Testing a channel you already know works: Some teams test their highest-ROAS channel first because they're confident it will validate their strategy. But the channels most worth testing are the ones where you have the most uncertainty -- typically retargeting, branded search, and long-running "always on" campaigns that nobody questions.

How to fix it

Before designing a test, write down the specific business decision it will inform. Examples:

  • "If Meta prospecting shows less than 1.5x incremental ROAS, we will reallocate $30K/month to TikTok"
  • "If retargeting shows less than 50% of platform-reported conversions are incremental, we will reduce retargeting budget by 40%"
  • "If branded search shows less than 20% of clicks are incremental, we will reduce bids by 50%"

The decision defines the test. If you can't articulate what you'll do differently based on the result, don't run the test.

The meta-lesson

Every mistake on this list comes from the same root cause: treating incrementality testing as a simple checkbox instead of a rigorous experimental design exercise. The concept is simple. The execution requires care. Agencies that build repeatable testing processes -- with pre-test checklists, power analyses, and pre-committed analysis plans -- produce reliable results. Those that wing it produce data that's no more trustworthy than the platform ROAS numbers they were trying to validate.

Frequently Asked Questions

How do I know if my past incrementality test results were reliable?

Review four things: Was the holdout group large enough (100+ conversions)? Did the pre-test period show parallel trends between treatment and holdout groups? Were there any external events that could have skewed results? Was the test duration determined in advance or was it stopped early? If any of these checks fail, treat the results as directional at best. Re-run the test with proper controls before making major budget decisions based on old data.

Should I hire a specialist to run incrementality tests?

For your first few tests, working with someone who has experimental design experience is valuable. The test setup is where most errors occur, and an experienced analyst will catch issues (underpowered designs, contamination risks, confounding variables) that aren't obvious to a media buyer seeing this for the first time. After running 3-4 tests with guidance, most teams can handle the process independently. The analytical step -- calculating lift and confidence intervals -- is straightforward with basic statistics knowledge or free online tools.

What do I do when incrementality and attribution data contradict each other?

Trust the incrementality data for the specific question it answers (causal impact) and trust the attribution data for the questions it answers (relative performance within a channel). If incrementality shows a channel delivers 50% of the conversions attribution claims, apply a 0.5 calibration factor to that channel's attributed conversions going forward. This is not a contradiction -- it's two different lenses on the same activity. The incrementality test is saying "half the people attribution credits to this channel would have converted anyway." Use the calibrated attribution data for ongoing optimization and re-test incrementality quarterly to keep the calibration current.


Go Funnel uses server-side tracking and multi-touch attribution to show you which ads actually drive revenue. Book a call to see your real numbers.

Want to see your real ROAS?

Connect your ad accounts in 15 minutes and get attribution data you can actually trust.

Book a Call

Related Articles