This document discusses best practices for designing, analyzing, and avoiding pitfalls in A/B testing. It recommends determining sample size and test duration based on key performance indicator distributions and seasonality. Both parametric and non-parametric statistical methods can be used to analyze results, and common pitfalls include picking non-representative metrics, violating statistical assumptions, and technical issues impacting experiment groups.
1 of 21
Downloaded 16 times
More Related Content
A/B Testing - Design, Analysis and Pitfals
1. A/B Testing C Design, Analysis and
Pitfalls
slavabo@gmail.com
5. Agenda
? Design the Experiment
? 2 main questions C how many users and how long to run the test
? Define reasonable number of KPIs
? Pay attention on seasonality/weekdays effect
? Analyze the Experiment
? Statistical methods for checking significance
? Non-parametric methods
? Outliers/bots/fraud
? Data-driven culture
? Pitfalls
? Open Discussion
6. Design
? Test Duration & Sample Size
? Duration needs to be defined before the experiment is started!
? Depends on distribution of main KPIs
? 80% have Binomial Distribution (Conversion Rate, CTR, etc´) +
CLT can help.
? 20% other (count events, revenue).
? Power calculations for defining N (size) and t (duration) OR use
rules of thumb.
? General rule C the less difference you want to catch the more data
you¨ll need to collect.
7. Design
? Example C # of searches per user (SweetIM)
? Poisson assumption for count events
? Not appropriate when variance >> mean
? NB was found appropriate
? Power limitation of NB
9. Design
? Define Reasonable Number of KPIs
? It¨s impossible to conclude based on 20 KPIs
? Project your KPI on Main Business (Lead) Indicators
? Consider Weighted KPIs or GPI (General Performance Indicator)
? Seasonality
? Weekends may have different user behavior than Weekdays
? Holidays can be unpredictable
? 7-days rule of thumb
11. Analysis
? Statistical Parametric Methods
? Use confidence intervals based on KPI distribution
? T-test, Chi-square test, etc will work, but´
? T-test assumes normal distribution of statistic
? Chi-square can be weak when low frequencies are observed
? Try Hypothesis testing based on KPI distribution C it¨s not simple
but worse it
12. ? Can be used as a generalization of Poisson in over dispersed cases (Var >>
Mean).
? Has been used before in other domains to analyze the count data
(genetics, traffic modeling).
? Fits well the real distribution.
0 100 200 300 400 500
0.000.050.100.150.200.25
Number of search
Frequency
Real data
Fitted NB
Fitted Poisson
13. Analysis
? Non-parametric tests
? When it¨s hard to estimate the distribution
? As Q&A for parametric tests
? Mann-Whitney, Kolmogorov-Smirnov
? Pros:
? Can be appropriate for unknown or not Normal distributions
? More robust than t-test
? Cons
? Less sensitive and have less power than parametric test (median as a
parameter)
? Assume that both samples come from the same distribution
? Assume normal distribution in large samples
14. Analysis
? Permutations tests
1. Calculate test statistic
2. Shuffle and resample 2 random groups
3. Calculate again test statistic
4. Compare to your original statistic, if is more extreme ->k=+1
5. Return on step 2 N times
6. Calculate the probability to get a result, more extreme than your
original k/N - this is your P-value
15. Analysis
? Check for outliers
? Plot your data on daily/hourly level
? Descriptive statistics can help (variance)
? Try to filter bots and crawlers
? It is almost impossible to filter all non-human activity on the web.
? Automatic bots and crawlers can bias the results and drive to wrong
conclusions.
? Continuous A/A test for sanity check for the whole system
? What difference you observe between A groups and is it
insignificant?
? Technical and tracking issues
16. Data-Driven Culture
? Avoid HiPPO that is not supported by data
Highest
Paid
Person¨s
Opinion
? Be clear about your KPI & how they affect your business
? Fight your ego C numbers don¨t lie
? 80%-90% of tests won¨t give positive result
? Learn from failed tests
17. Pitfalls
? Picking an easy-to-beat KPI without relation to lead business
metrics
? Example C focusing on increase click-through rate for
banners/buttons and ignoring other metrics like user retention or
revenue.
? Using incorrect statistical methods or violate the assumptions
? Example 1 C assuming that KPI has Normal distribution without
actually checking it.
? Example 2 C Using online significance calculators without
understanding the data distribution
18. Pitfalls
? Combining ratios from different proportions over time -Simpson¨s
Paradox
? Example:
? Ignoring outliers and bots | not plotting data on a timeline
? Example: One outlier can change the test results
19. Pitfalls
? Starting test without validation (A/A test as a solution)
? Change control group during the test (solution- change them
both!)
? Technical issues with experiment group
? Example C redirect , cash, new technology
? Running your experiment ^until it will reach significant difference ̄
? Not ^anchoring ̄ users to one group only (also cookie problems)
21. Reference
? How Not To Run An A/B Test
? http://www.evanmiller.org/how-not-to-run-an-ab-test.html
? Microsoft Experimentation Platform
? http://www.exp-platform.com/Pages/ExPpitfalls.aspx
? Simpson¨s Paradox
? http://vudlab.com/simpsons/