際際滷

際際滷Share a Scribd company logo
A/B Testing C Design, Analysis and
Pitfalls
slavabo@gmail.com
Business Package Test
Additional Advertising Test
A/B Testing - Design, Analysis and Pitfals
Agenda
? Design the Experiment
? 2 main questions C how many users and how long to run the test
? Define reasonable number of KPIs
? Pay attention on seasonality/weekdays effect
? Analyze the Experiment
? Statistical methods for checking significance
? Non-parametric methods
? Outliers/bots/fraud
? Data-driven culture
? Pitfalls
? Open Discussion
Design
? Test Duration & Sample Size
? Duration needs to be defined before the experiment is started!
? Depends on distribution of main KPIs
? 80% have Binomial Distribution (Conversion Rate, CTR, etc´) +
CLT can help.
? 20% other (count events, revenue).
? Power calculations for defining N (size) and t (duration) OR use
rules of thumb.
? General rule C the less difference you want to catch the more data
you¨ll need to collect.
Design
? Example C # of searches per user (SweetIM)
? Poisson assumption for count events
? Not appropriate when variance >> mean
? NB was found appropriate
? Power limitation of NB
Statistical Power and Sensitivity
50% 55% 60% 65% 70% 75% 80% 85% 90% 95%
0.5% 1,001,243 1,133,747 1,276,816 1,433,622 1,608,696 1,808,942 2,045,743 2,340,142 2,738,670 3,386,960
1.0% 251,556 284,847 320,792 360,188 404,175 454,485 513,980 587,946 688,074 850,953
1.5% 112,359 127,228 143,284 160,880 180,527 202,998 229,572 262,609 307,332 380,083
2.0% 63,516 71,922 80,998 90,945 102,052 114,755 129,777 148,453 173,734 214,860
2.5% 40,853 46,259 52,097 58,494 65,638 73,808 83,470 95,482 111,743 138,194
3.0% 28,511 32,284 36,358 40,823 45,809 51,511 58,254 66,637 77,985 96,446
3.5% 21,051 23,837 26,845 30,142 33,823 38,033 43,011 49,201 57,580 71,210
4.0% 16,197 18,341 20,655 23,192 26,024 29,264 33,094 37,857 44,304 54,791
4.5% 12,861 14,564 16,401 18,416 20,665 23,237 26,279 30,060 35,180 43,507
5.0% 10,470 11,855 13,351 14,991 16,821 18,915 21,391 24,470 28,637 35,416
5.5% 8,696 9,846 11,089 12,451 13,971 15,710 17,767 20,324 23,785 29,415
6.0% 7,343 8,315 9,364 10,514 11,798 13,267 15,003 17,162 20,085 24,839
6.5% 6,288 7,120 8,018 9,003 10,103 11,360 12,847 14,696 17,199 21,270
7.0% 5,449 6,170 6,948 7,801 8,754 9,844 11,132 12,735 14,903 18,431
7.5% 4,770 5,401 6,083 6,830 7,664 8,618 9,746 11,148 13,047 16,135
8.0% 4,213 4,771 5,373 6,032 6,769 7,612 8,608 9,847 11,524 14,252
8.5% 3,750 4,247 4,783 5,370 6,026 6,776 7,663 8,766 10,259 12,687
9.0% 3,362 3,807 4,287 4,814 5,402 6,074 6,869 7,858 9,196 11,373
9.5% 3,032 3,434 3,867 4,342 4,872 5,478 6,196 7,087 8,294 10,258
10.0% 2,750 3,114 3,507 3,938 4,419 4,969 5,619 6,428 7,523 9,303
Sensitivity
Statistical Power
Sample size as a function of sensitivity and statistical power; Negative Binomial
parameter α =0.31, average and length of the test ? = 30, ? = 0.69
Design
? Define Reasonable Number of KPIs
? It¨s impossible to conclude based on 20 KPIs
? Project your KPI on Main Business (Lead) Indicators
? Consider Weighted KPIs or GPI (General Performance Indicator)
? Seasonality
? Weekends may have different user behavior than Weekdays
? Holidays can be unpredictable
? 7-days rule of thumb
Analysis
? Statistical Parametric Methods
? Non-Parametric Methods
? Permutation Tests
? Outliers/Bots/Fraud
Analysis
? Statistical Parametric Methods
? Use confidence intervals based on KPI distribution
? T-test, Chi-square test, etc will work, but´
? T-test assumes normal distribution of statistic
? Chi-square can be weak when low frequencies are observed
? Try Hypothesis testing based on KPI distribution C it¨s not simple
but worse it
? Can be used as a generalization of Poisson in over dispersed cases (Var >>
Mean).
? Has been used before in other domains to analyze the count data
(genetics, traffic modeling).
? Fits well the real distribution.
0 100 200 300 400 500
0.000.050.100.150.200.25
Number of search
Frequency
Real data
Fitted NB
Fitted Poisson
Analysis
? Non-parametric tests
? When it¨s hard to estimate the distribution
? As Q&A for parametric tests
? Mann-Whitney, Kolmogorov-Smirnov
? Pros:
? Can be appropriate for unknown or not Normal distributions
? More robust than t-test
? Cons
? Less sensitive and have less power than parametric test (median as a
parameter)
? Assume that both samples come from the same distribution
? Assume normal distribution in large samples
Analysis
? Permutations tests
1. Calculate test statistic
2. Shuffle and resample 2 random groups
3. Calculate again test statistic
4. Compare to your original statistic, if is more extreme ->k=+1
5. Return on step 2 N times
6. Calculate the probability to get a result, more extreme than your
original k/N - this is your P-value
Analysis
? Check for outliers
? Plot your data on daily/hourly level
? Descriptive statistics can help (variance)
? Try to filter bots and crawlers
? It is almost impossible to filter all non-human activity on the web.
? Automatic bots and crawlers can bias the results and drive to wrong
conclusions.
? Continuous A/A test for sanity check for the whole system
? What difference you observe between A groups and is it
insignificant?
? Technical and tracking issues
Data-Driven Culture
? Avoid HiPPO that is not supported by data
Highest
Paid
Person¨s
Opinion
? Be clear about your KPI & how they affect your business
? Fight your ego C numbers don¨t lie
? 80%-90% of tests won¨t give positive result
? Learn from failed tests
Pitfalls
? Picking an easy-to-beat KPI without relation to lead business
metrics
? Example C focusing on increase click-through rate for
banners/buttons and ignoring other metrics like user retention or
revenue.
? Using incorrect statistical methods or violate the assumptions
? Example 1 C assuming that KPI has Normal distribution without
actually checking it.
? Example 2 C Using online significance calculators without
understanding the data distribution
Pitfalls
? Combining ratios from different proportions over time -Simpson¨s
Paradox
? Example:
? Ignoring outliers and bots | not plotting data on a timeline
? Example: One outlier can change the test results
Pitfalls
? Starting test without validation (A/A test as a solution)
? Change control group during the test (solution- change them
both!)
? Technical issues with experiment group
? Example C redirect , cash, new technology
? Running your experiment ^until it will reach significant difference ̄
? Not ^anchoring ̄ users to one group only (also cookie problems)
A/B Testing - Design, Analysis and Pitfals
Reference
? How Not To Run An A/B Test
? http://www.evanmiller.org/how-not-to-run-an-ab-test.html
? Microsoft Experimentation Platform
? http://www.exp-platform.com/Pages/ExPpitfalls.aspx
? Simpson¨s Paradox
? http://vudlab.com/simpsons/

More Related Content

A/B Testing - Design, Analysis and Pitfals

  • 1. A/B Testing C Design, Analysis and Pitfalls slavabo@gmail.com
  • 5. Agenda ? Design the Experiment ? 2 main questions C how many users and how long to run the test ? Define reasonable number of KPIs ? Pay attention on seasonality/weekdays effect ? Analyze the Experiment ? Statistical methods for checking significance ? Non-parametric methods ? Outliers/bots/fraud ? Data-driven culture ? Pitfalls ? Open Discussion
  • 6. Design ? Test Duration & Sample Size ? Duration needs to be defined before the experiment is started! ? Depends on distribution of main KPIs ? 80% have Binomial Distribution (Conversion Rate, CTR, etc´) + CLT can help. ? 20% other (count events, revenue). ? Power calculations for defining N (size) and t (duration) OR use rules of thumb. ? General rule C the less difference you want to catch the more data you¨ll need to collect.
  • 7. Design ? Example C # of searches per user (SweetIM) ? Poisson assumption for count events ? Not appropriate when variance >> mean ? NB was found appropriate ? Power limitation of NB
  • 8. Statistical Power and Sensitivity 50% 55% 60% 65% 70% 75% 80% 85% 90% 95% 0.5% 1,001,243 1,133,747 1,276,816 1,433,622 1,608,696 1,808,942 2,045,743 2,340,142 2,738,670 3,386,960 1.0% 251,556 284,847 320,792 360,188 404,175 454,485 513,980 587,946 688,074 850,953 1.5% 112,359 127,228 143,284 160,880 180,527 202,998 229,572 262,609 307,332 380,083 2.0% 63,516 71,922 80,998 90,945 102,052 114,755 129,777 148,453 173,734 214,860 2.5% 40,853 46,259 52,097 58,494 65,638 73,808 83,470 95,482 111,743 138,194 3.0% 28,511 32,284 36,358 40,823 45,809 51,511 58,254 66,637 77,985 96,446 3.5% 21,051 23,837 26,845 30,142 33,823 38,033 43,011 49,201 57,580 71,210 4.0% 16,197 18,341 20,655 23,192 26,024 29,264 33,094 37,857 44,304 54,791 4.5% 12,861 14,564 16,401 18,416 20,665 23,237 26,279 30,060 35,180 43,507 5.0% 10,470 11,855 13,351 14,991 16,821 18,915 21,391 24,470 28,637 35,416 5.5% 8,696 9,846 11,089 12,451 13,971 15,710 17,767 20,324 23,785 29,415 6.0% 7,343 8,315 9,364 10,514 11,798 13,267 15,003 17,162 20,085 24,839 6.5% 6,288 7,120 8,018 9,003 10,103 11,360 12,847 14,696 17,199 21,270 7.0% 5,449 6,170 6,948 7,801 8,754 9,844 11,132 12,735 14,903 18,431 7.5% 4,770 5,401 6,083 6,830 7,664 8,618 9,746 11,148 13,047 16,135 8.0% 4,213 4,771 5,373 6,032 6,769 7,612 8,608 9,847 11,524 14,252 8.5% 3,750 4,247 4,783 5,370 6,026 6,776 7,663 8,766 10,259 12,687 9.0% 3,362 3,807 4,287 4,814 5,402 6,074 6,869 7,858 9,196 11,373 9.5% 3,032 3,434 3,867 4,342 4,872 5,478 6,196 7,087 8,294 10,258 10.0% 2,750 3,114 3,507 3,938 4,419 4,969 5,619 6,428 7,523 9,303 Sensitivity Statistical Power Sample size as a function of sensitivity and statistical power; Negative Binomial parameter α =0.31, average and length of the test ? = 30, ? = 0.69
  • 9. Design ? Define Reasonable Number of KPIs ? It¨s impossible to conclude based on 20 KPIs ? Project your KPI on Main Business (Lead) Indicators ? Consider Weighted KPIs or GPI (General Performance Indicator) ? Seasonality ? Weekends may have different user behavior than Weekdays ? Holidays can be unpredictable ? 7-days rule of thumb
  • 10. Analysis ? Statistical Parametric Methods ? Non-Parametric Methods ? Permutation Tests ? Outliers/Bots/Fraud
  • 11. Analysis ? Statistical Parametric Methods ? Use confidence intervals based on KPI distribution ? T-test, Chi-square test, etc will work, but´ ? T-test assumes normal distribution of statistic ? Chi-square can be weak when low frequencies are observed ? Try Hypothesis testing based on KPI distribution C it¨s not simple but worse it
  • 12. ? Can be used as a generalization of Poisson in over dispersed cases (Var >> Mean). ? Has been used before in other domains to analyze the count data (genetics, traffic modeling). ? Fits well the real distribution. 0 100 200 300 400 500 0.000.050.100.150.200.25 Number of search Frequency Real data Fitted NB Fitted Poisson
  • 13. Analysis ? Non-parametric tests ? When it¨s hard to estimate the distribution ? As Q&A for parametric tests ? Mann-Whitney, Kolmogorov-Smirnov ? Pros: ? Can be appropriate for unknown or not Normal distributions ? More robust than t-test ? Cons ? Less sensitive and have less power than parametric test (median as a parameter) ? Assume that both samples come from the same distribution ? Assume normal distribution in large samples
  • 14. Analysis ? Permutations tests 1. Calculate test statistic 2. Shuffle and resample 2 random groups 3. Calculate again test statistic 4. Compare to your original statistic, if is more extreme ->k=+1 5. Return on step 2 N times 6. Calculate the probability to get a result, more extreme than your original k/N - this is your P-value
  • 15. Analysis ? Check for outliers ? Plot your data on daily/hourly level ? Descriptive statistics can help (variance) ? Try to filter bots and crawlers ? It is almost impossible to filter all non-human activity on the web. ? Automatic bots and crawlers can bias the results and drive to wrong conclusions. ? Continuous A/A test for sanity check for the whole system ? What difference you observe between A groups and is it insignificant? ? Technical and tracking issues
  • 16. Data-Driven Culture ? Avoid HiPPO that is not supported by data Highest Paid Person¨s Opinion ? Be clear about your KPI & how they affect your business ? Fight your ego C numbers don¨t lie ? 80%-90% of tests won¨t give positive result ? Learn from failed tests
  • 17. Pitfalls ? Picking an easy-to-beat KPI without relation to lead business metrics ? Example C focusing on increase click-through rate for banners/buttons and ignoring other metrics like user retention or revenue. ? Using incorrect statistical methods or violate the assumptions ? Example 1 C assuming that KPI has Normal distribution without actually checking it. ? Example 2 C Using online significance calculators without understanding the data distribution
  • 18. Pitfalls ? Combining ratios from different proportions over time -Simpson¨s Paradox ? Example: ? Ignoring outliers and bots | not plotting data on a timeline ? Example: One outlier can change the test results
  • 19. Pitfalls ? Starting test without validation (A/A test as a solution) ? Change control group during the test (solution- change them both!) ? Technical issues with experiment group ? Example C redirect , cash, new technology ? Running your experiment ^until it will reach significant difference ̄ ? Not ^anchoring ̄ users to one group only (also cookie problems)
  • 21. Reference ? How Not To Run An A/B Test ? http://www.evanmiller.org/how-not-to-run-an-ab-test.html ? Microsoft Experimentation Platform ? http://www.exp-platform.com/Pages/ExPpitfalls.aspx ? Simpson¨s Paradox ? http://vudlab.com/simpsons/