Holdout Tests Explained: Measuring True Ad Impact
Holdout tests withhold ads from a control group to measure true incremental impact. Here's how they work and why every media buyer should run them.
A holdout test answers the question your ad platform never will
Every ad platform tells you how many conversions your ads "drove." None of them tell you how many of those conversions would have happened anyway. A holdout test answers that second question by deliberately withholding ads from a randomly selected control group and measuring the difference.
The concept comes from clinical trials. Medical researchers don't just give a drug to patients and see if they improve -- they compare to a placebo group. Holdout tests apply the same rigor to advertising. The "placebo group" sees no ads (or a neutral Public Service Announcement). The treatment group sees your actual campaigns. The difference in outcomes is your true incremental impact.
How holdout tests work mechanically
User-level holdouts
The platform randomly assigns users to either the treatment group (sees your ads) or the holdout group (doesn't see your ads). Both groups are tracked for the same conversion event over the same time period.
Meta's Conversion Lift does this natively. You set up the test in Experiments, define the conversion event, and Meta handles the randomization. The holdout group is suppressed from your ad delivery while remaining in the target audience for measurement purposes.
Google's Conversion Lift works similarly but requires a Google rep to set up. It creates a user-level holdout within your campaign targeting.
The advantage of user-level holdouts: perfect randomization eliminates selection bias. The disadvantage: they only work within a single platform.
Geographic holdouts
You select geographic regions (DMAs, states, metro areas) and pause all advertising in those regions. Conversions in holdout regions represent your organic baseline.
Geographic holdouts are the only method that works across all channels simultaneously. You can measure the combined incremental impact of your entire media portfolio, not just one platform.
The trade-off: geographic holdouts have more noise because markets differ in ways that affect conversion rates (demographics, competition, seasonality). Proper market matching minimizes this, but it's never as clean as user-level randomization.
Time-based holdouts
The simplest but least reliable method: pause ads entirely for a period, then compare to a period with ads running. No randomization, no control group -- just before/after comparison.
Time-based holdouts are a last resort. They can't distinguish between ad impact and other factors that change over time (seasonality, competitor activity, organic growth). Use them only when user-level and geographic methods aren't feasible, and interpret results cautiously.
Sizing your holdout group correctly
The trade-off between precision and lost revenue
A larger holdout gives you more statistical precision but costs you more in forgone conversions. The standard range is 5-20% of your audience.
5% holdout: Minimal revenue impact. Requires a large base of conversions (200+ per day) or a long test period (4+ weeks) to detect moderate effects.
10% holdout: The standard for most advertisers. Balances statistical power with revenue protection. Detects a 15-20% lift with 95% confidence over 2-3 weeks for campaigns with 50+ daily conversions.
15-20% holdout: Use for smaller campaigns or when you need to detect small effects (5-10% lift). The revenue impact is real -- a 20% holdout means 20% of your target audience doesn't see ads during the test.
Minimum conversion thresholds
The holdout group needs enough conversions to produce a reliable baseline rate. Rules of thumb:
- Minimum 50 conversions in the holdout for a directional read
- Minimum 100 conversions in the holdout for a decision-quality result
- Minimum 200 conversions in the holdout for precise measurement with tight confidence intervals
If your campaign generates 30 conversions per day and you use a 10% holdout, the holdout will see about 3 conversions per day. Over 14 days, that's only 42 holdout conversions -- not enough for a reliable result. Either increase the holdout to 20% (84 conversions) or extend the test to 28 days (84 conversions at 10%).
Running the test: what to watch for
Don't leak impressions to the holdout
In user-level tests, platform tools handle this automatically. In geographic tests, check delivery reports daily to confirm zero impressions are serving in holdout regions. Accidental leakage contaminates the test by partially treating the control group, which biases results toward zero lift.
Common leakage sources:
- Programmatic ads served based on IP ranges that cross DMA boundaries
- Social media ads that target "nearby" locations
- Connected TV platforms with imprecise geographic targeting
- Users traveling between treatment and holdout markets
Don't change anything mid-test
No creative refreshes. No bid strategy changes. No budget adjustments. No new campaigns. Any change during the test period creates ambiguity about what caused the observed effect. If you absolutely must make a change, note the exact date and time, and plan to analyze pre-change and post-change periods separately.
Monitor for external shocks
Events that differentially affect treatment and holdout groups can bias results. Track:
- Weather events in specific markets
- Local competitor promotions
- Regional news coverage
- Differences in promotional email targeting between regions
If an external shock occurs, you may need to extend the test or exclude the affected period from analysis.
Reading the results
The key metrics
Raw lift: Treatment conversion rate minus holdout conversion rate. This is the absolute incremental effect in percentage points.
Relative lift: (Treatment CR - Holdout CR) / Holdout CR. This expresses the lift as a percentage of the baseline. A 50% relative lift means your ads increased conversions by half over what would have happened organically.
Confidence interval: The range within which the true lift likely falls. A result of "25% lift with a 95% CI of 10-40%" means the lift is probably between 10% and 40%. If the confidence interval includes zero, the result is not statistically significant.
Incremental CPA: Your ad spend divided by only the incremental conversions. This is always higher than your platform-reported CPA because it excludes organic conversions that the platform claims credit for.
What "no significant lift" means
A non-significant result does not prove your ads don't work. It means one of three things:
- Your ads genuinely have no incremental impact on this audience
- The true lift is too small for your test to detect (a power problem)
- Your test duration was too short to accumulate enough data
Before concluding that a channel has zero incrementality, verify that your test had sufficient power (at least 80%) to detect a meaningful lift. If power was below 80%, the test was underpowered and you need to re-run with a larger holdout or longer duration.
Holdout test results by channel: what to expect
Based on aggregated results from industry studies and our clients:
| Channel | Typical Platform ROAS | Typical Incremental ROAS | Inflation Factor | |---------|----------------------|-------------------------|-----------------| | Retargeting | 8-15x | 0.5-2x | 5-10x | | Branded search | 10-20x | 0.5-3x | 5-15x | | Prospecting (Meta) | 2-5x | 1-3x | 1.5-3x | | Prospecting (Google) | 2-4x | 1-2.5x | 1.5-2.5x | | YouTube | 1-3x | 0.5-2x | 1.5-2x |
The pattern is consistent: platforms over-claim by 1.5-10x depending on the channel. Retargeting and branded search show the largest gaps because they target high-intent users who would likely convert without the ad.
Frequently Asked Questions
How long should a holdout test run?
The minimum is 2 weeks for high-volume campaigns (100+ daily conversions) and 4-6 weeks for lower-volume campaigns (20-100 daily conversions). Always run the test for at least one full business cycle -- if your business has weekly patterns, run full weeks. If you have monthly patterns (B2B), you may need 6-8 weeks. The test should also extend beyond your typical purchase consideration window, so if customers typically take 10 days from first exposure to purchase, add those 10 days to your test duration.
Can I run a holdout test on retargeting without losing customers?
Yes, and this is one of the highest-value tests you can run. The holdout group only represents 10-15% of your retargeting audience, so the revenue impact is limited. More importantly, retargeting holdout tests consistently show that 50-85% of retargeted conversions would have happened without the retargeting ad. The customers already visited your site, added to cart, and demonstrated purchase intent. You may find that reducing retargeting spend by 50% and reallocating to prospecting produces a better total outcome.
What's the difference between a holdout test and a PSA test?
In a holdout test, the control group sees no ad at all. In a PSA (Public Service Announcement) test, the control group sees an irrelevant ad -- like a charity PSA -- in place of your ad. PSA tests are considered more rigorous because they control for the effect of seeing any ad (ad blindness, banner fatigue). However, PSA tests cost more because you're still paying for impressions to the control group. For most advertisers, a standard holdout test is sufficient. PSA tests are most valuable when testing brand awareness campaigns where the mere presence of an ad (any ad) in a particular placement might affect behavior.
Go Funnel uses server-side tracking and multi-touch attribution to show you which ads actually drive revenue. Book a call to see your real numbers.
Want to see your real ROAS?
Connect your ad accounts in 15 minutes and get attribution data you can actually trust.
Book a Call