Achieving Statistical Significance with AdSense A/B-Testing

Most publishers are aware that testing different templates against each other is one of the easiest ways to increase AdSense revenue. Knowing when you have actually stumbled on a winning template, however, is surprisingly complex.

Some publishers are tempted pull out their statistics textbooks and start performing the calculations to determine the thresholds for various confidence levels. They start graphing the curves that illustrate the requisite margins-of-victory of one template over another.

Unfortunately, there is much more noise in the AdSense system that makes attaining statistical validity much more difficult. All of the following can contribute to the noise in the data you will get back from an A/B test with AdSense:

User behavior. Users browsing from work act differently than they do from home or from a public place. If it is a holiday, for example, then a higher percentage of your visitors might be visiting from their homes, and their ad-clicking-behavior can be very different. Similarly, weekend behavior is often dramatically different than weekday behavior. For this reason, a test of an absolute minimum of a week is advised even for the most heavily-trafficked websites.

Specific ad clicks. The particular ad that a user clicked on during a visit can throw off the results as well. Perhaps the easiest way to illustrate this is to imagine a site that only attracted two particular ads to their site. One ad only appeared occasionally, but it paid $50 per click to the publisher. The other ad was displayed regularly and yielded $.05 per click. In an A/B test, a single click on the $50 ad could throw off the results. Of course, this extreme probably never occurs, but nonetheless, some users occasionally click on very-high-paying-ads, and this can artificially increase the apparent efficacy of a particular template.

Advertisers. Every page impression on your site initiates an ad auction in which advertisers bid for publisher inventory. These auctions are very dynamic, and large advertisers can increase/decrease their bids or cancel their campaigns during a publisher’s A/B test. In doing so, advertiser behavior can impact the revenue publishers will earn from ad clicks. This impact is particularly true for publishers that operate within a specific niche that attracts only a small handful of advertisers.

Inventory. Just as the advertisers are competing in an auction, publishers are also competing with other publishers’ inventory. Although some advertisers have virtually unlimited budgets (i.e., those that have achieved positive ROI on their campaigns); many advertisers have limited budgets for their campaigns, and once their ads have been clicked a certain number of times in a given period, they stop appearing. If a competing publisher starts to attract more visitors to pages in a particular niche, then that could mean that the highest-paying-ads will be diluted across more sites.

Smartpricing. In an effort to attract advertisers to its Content Network, Google introduced Smartpricing. Smartpricing discounts CPC payouts to publishers who Google’s system has determined deliver fewer converting users. It is not clear how Smartpricing is applied. If it is applied evenly to an entire site, then it might not have a statistical impact on the results of an A/B test; if Google’s Smartpricing system is more nuanced, then it might introduce even more noise into the system.

Interest-based ads. Google’s algorithm for showing visitors interest-based ads instead of contextually relevant ads is also not transparent. Depending on how it is implemented, the use/non-use of interest-based-ads on a pageview can affect click-behavior and therefore impact the results of an A/B test.

Traffic sources. Site visitors come from a wide variety of sources: bookmarks, search engines, social-media, email, etc. If during a test, there is a change in a site’s traffic-sources, noise can be introduced. One example of traffic-source fluctuation comes from search engines. Search engines change their algorithms regularly and new competitors in the SERPs (search engine results pages) can appear overnight; traffic from search engines can fluctuate during an A/B test and therefore impact the validity of the results.

Randomizer. The randomizing script that is used to choose which random template will be served to which user is also subject to noise. If you were to flip a coin an even number of times, you might expect it to fall on one side half the time. In reality, however, this rarely happens, even over large datasets. Statistics can account for this type of randomness but since it is just one of many types of noise in the system, you need to do more than employ some statistical modeling. One possible approach to omitting this facet of the noise is to serve alternate templates to each subsequent visitor. This way each template gets served precisely 50% of the time.

Each of these factors separately creates a relatively small amount of noise, but in conjunction with each other, they are significant. Since each website will be affected differently, an astute publisher will determine the noise levels for their particular site.

The method we have developed to determine the noise levels for our sites is to run an A/B test where A equalsB. In other words, we ran templates that were identical to each other in every way and assigned unique AdSense channels to the ads on both.1 This test should run for as long as possible. It can operate in the background since Google allows publishers to assign multiple channels for each ad impression. In other words, a publisher can test other things while this noise-level-test is running in the background. Note that other template-based-tests cannot run simultaneously with this noise-level-test.

We ran this test for several months, and then analyzed the results. If there was no randomness/noise in the system, the results for the A and B templates would have been identical since the templates themselves were identical. In reality, there were surprising variations. Using a spreadsheet, we isolated the greatest deviation within a particular day and, shockingly, it was nearly 8% on a very large amount of traffic. In other words, if a publisher runs a test for a day, and one template beats another by 5%, that does not necessarily mean that it will deliver greater revenue.

With some further analysis, we found the maximum variations for any two consecutive days, and any three consecutive days etc. So now, we use these values as the baseline for determining when we have achieved statistical validity.2

How confident do you need to be before declaring a winner and moving on to the next A/B test? A publisher that waited for results that could be believed with 99.9% confidence, would probably be wasting time on that test (since you can only test one batch of templates against each other at any given time). You might have to wait for weeks or months to achieve that level of confidence. So, if you have a long list of ideas for template tweaks, it is probably better to be satisfied with a lower confidence rating when choosing winners. After you are reasonably sure that you have a winner, incorporate it into the control template and start the next round of testing. After you run out of ideas, you can go back and retest particular aspects of your implementation to achieve higher confidence in their efficacy.

1 I should note that since we usually test 5 templates at any given time, we actually ran an A/B/C/D/E test to determine noise levels.

2 Thorough optimizers may want chart the maximum variations for each of the values that AdSense provides (pageviews, CTR, CPC, RPM, Revenue). The results are very surprising, but it would likely be unwise to lose focus on the metric that you are trying to maximize (most likely revenue).