�ݺ�ߣ

A/B Testing �C Design, Analysis and
Pitfalls
slavabo@gmail.com

A/B Testing - Design, Analysis and Pitfals

Agenda
? Design the Experiment
? 2 main questions �C how many users and how long to run the test
? Define reasonable number of KPIs
? Pay attention on seasonality/weekdays effect
? Analyze the Experiment
? Statistical methods for checking significance
? Non-parametric methods
? Outliers/bots/fraud
? Data-driven culture
? Pitfalls
? Open Discussion

Design
? Test Duration & Sample Size
? Duration needs to be defined before the experiment is started!
? Depends on distribution of main KPIs
? 80% have Binomial Distribution (Conversion Rate, CTR, etc��) +
CLT can help.
? 20% other (count events, revenue).
? Power calculations for defining N (size) and t (duration) OR use
rules of thumb.
? General rule �C the less difference you want to catch the more data
you��ll need to collect.

Design
? Example �C # of searches per user (SweetIM)
? Poisson assumption for count events
? Not appropriate when variance >> mean
? NB was found appropriate
? Power limitation of NB

Statistical Power and Sensitivity
50% 55% 60% 65% 70% 75% 80% 85% 90% 95%
0.5% 1,001,243 1,133,747 1,276,816 1,433,622 1,608,696 1,808,942 2,045,743 2,340,142 2,738,670 3,386,960
1.0% 251,556 284,847 320,792 360,188 404,175 454,485 513,980 587,946 688,074 850,953
1.5% 112,359 127,228 143,284 160,880 180,527 202,998 229,572 262,609 307,332 380,083
2.0% 63,516 71,922 80,998 90,945 102,052 114,755 129,777 148,453 173,734 214,860
2.5% 40,853 46,259 52,097 58,494 65,638 73,808 83,470 95,482 111,743 138,194
3.0% 28,511 32,284 36,358 40,823 45,809 51,511 58,254 66,637 77,985 96,446
3.5% 21,051 23,837 26,845 30,142 33,823 38,033 43,011 49,201 57,580 71,210
4.0% 16,197 18,341 20,655 23,192 26,024 29,264 33,094 37,857 44,304 54,791
4.5% 12,861 14,564 16,401 18,416 20,665 23,237 26,279 30,060 35,180 43,507
5.0% 10,470 11,855 13,351 14,991 16,821 18,915 21,391 24,470 28,637 35,416
5.5% 8,696 9,846 11,089 12,451 13,971 15,710 17,767 20,324 23,785 29,415
6.0% 7,343 8,315 9,364 10,514 11,798 13,267 15,003 17,162 20,085 24,839
6.5% 6,288 7,120 8,018 9,003 10,103 11,360 12,847 14,696 17,199 21,270
7.0% 5,449 6,170 6,948 7,801 8,754 9,844 11,132 12,735 14,903 18,431
7.5% 4,770 5,401 6,083 6,830 7,664 8,618 9,746 11,148 13,047 16,135
8.0% 4,213 4,771 5,373 6,032 6,769 7,612 8,608 9,847 11,524 14,252
8.5% 3,750 4,247 4,783 5,370 6,026 6,776 7,663 8,766 10,259 12,687
9.0% 3,362 3,807 4,287 4,814 5,402 6,074 6,869 7,858 9,196 11,373
9.5% 3,032 3,434 3,867 4,342 4,872 5,478 6,196 7,087 8,294 10,258
10.0% 2,750 3,114 3,507 3,938 4,419 4,969 5,619 6,428 7,523 9,303
Sensitivity
Statistical Power
Sample size as a function of sensitivity and statistical power; Negative Binomial
parameter �� =0.31, average and length of the test ? = 30, ? = 0.69

Design
? Define Reasonable Number of KPIs
? It��s impossible to conclude based on 20 KPIs
? Project your KPI on Main Business (Lead) Indicators
? Consider Weighted KPIs or GPI (General Performance Indicator)
? Seasonality
? Weekends may have different user behavior than Weekdays
? Holidays can be unpredictable
? 7-days rule of thumb

Analysis
? Statistical Parametric Methods
? Non-Parametric Methods
? Permutation Tests
? Outliers/Bots/Fraud

Analysis
? Statistical Parametric Methods
? Use confidence intervals based on KPI distribution
? T-test, Chi-square test, etc will work, but��
? T-test assumes normal distribution of statistic
? Chi-square can be weak when low frequencies are observed
? Try Hypothesis testing based on KPI distribution �C it��s not simple
but worse it

? Can be used as a generalization of Poisson in over dispersed cases (Var >>
Mean).
? Has been used before in other domains to analyze the count data
(genetics, traffic modeling).
? Fits well the real distribution.
0 100 200 300 400 500
0.000.050.100.150.200.25
Number of search
Frequency
Real data
Fitted NB
Fitted Poisson

Analysis
? Non-parametric tests
? When it��s hard to estimate the distribution
? As Q&A for parametric tests
? Mann-Whitney, Kolmogorov-Smirnov
? Pros:
? Can be appropriate for unknown or not Normal distributions
? More robust than t-test
? Cons
? Less sensitive and have less power than parametric test (median as a
parameter)
? Assume that both samples come from the same distribution
? Assume normal distribution in large samples

Analysis
? Permutations tests
1. Calculate test statistic
2. Shuffle and resample 2 random groups
3. Calculate again test statistic
4. Compare to your original statistic, if is more extreme ->k=+1
5. Return on step 2 N times
6. Calculate the probability to get a result, more extreme than your
original k/N - this is your P-value

Analysis
? Check for outliers
? Plot your data on daily/hourly level
? Descriptive statistics can help (variance)
? Try to filter bots and crawlers
? It is almost impossible to filter all non-human activity on the web.
? Automatic bots and crawlers can bias the results and drive to wrong
conclusions.
? Continuous A/A test for sanity check for the whole system
? What difference you observe between A groups and is it
insignificant?
? Technical and tracking issues

Data-Driven Culture
? Avoid HiPPO that is not supported by data
Highest
Paid
Person��s
Opinion
? Be clear about your KPI & how they affect your business
? Fight your ego �C numbers don��t lie
? 80%-90% of tests won��t give positive result
? Learn from failed tests

Pitfalls
? Picking an easy-to-beat KPI without relation to lead business
metrics
? Example �C focusing on increase click-through rate for
banners/buttons and ignoring other metrics like user retention or
revenue.
? Using incorrect statistical methods or violate the assumptions
? Example 1 �C assuming that KPI has Normal distribution without
actually checking it.
? Example 2 �C Using online significance calculators without
understanding the data distribution

Pitfalls
? Combining ratios from different proportions over time -Simpson��s
Paradox
? Example:
? Ignoring outliers and bots | not plotting data on a timeline
? Example: One outlier can change the test results

Pitfalls
? Starting test without validation (A/A test as a solution)
? Change control group during the test (solution- change them
both!)
? Technical issues with experiment group
? Example �C redirect , cash, new technology
? Running your experiment ��until it will reach significant difference��
? Not ��anchoring�� users to one group only (also cookie problems)

Reference
? How Not To Run An A/B Test
? http://www.evanmiller.org/how-not-to-run-an-ab-test.html
? Microsoft Experimentation Platform
? http://www.exp-platform.com/Pages/ExPpitfalls.aspx
? Simpson��s Paradox
? http://vudlab.com/simpsons/

�ݺ�ߣ

A/B Testing - Design, Analysis and Pitfals

More Related Content

A/B Testing - Design, Analysis and Pitfals