�ݺ�ߣ

Bandit Basics – A Different
take on Online Optimization

Conductrics twitter: @mgershoff

Who is this guy?

Matt Gershoff
CEO: Conductrics
Many Years in Database Marketing (New York
and Paris)
and a bit of Web Analytics

www.conductrics.com
twitter:@mgershoff
Email:matt@conductrics.com

Speak Up


What Are We Going to Hear?
• Optimization Basics
• Multi-Armed Bandit
• Its a Problem, Not a Method
• Some Methods
• AB Testing
• Epsilon Greedy
• Upper Confidence Interval (UCB)
• Some Results

Choices Targeting

Learning Optimization


OPTIMIZATION

If THIS Then THAT
ITTT brings together:
1.Decision Rules
2.Predictive Analytics
3.Choice Optimization

OPTIMIZATION
Find and Apply the Rule with the most Value

If THIS Then THAT If THIS Then THAT


OPTIMIZATION
Variables whose Values Variables whose Values
Are Given to You You Control
THIS THAT

If Facebook
High Spend
Urban GEO
.
Then Offer A
Offer B
Offer C
.
. Predictive Model .
. .
. F1
.
.
F2
S Valuei .
Home Page Fm
Offer Y
App Use Offer Z

Inputs Conductrics twitter: @mgershoff
Outputs

But …
1. We Don’t Have Data on ‘THAT’
2. Need to Collect – Sample THAT
3. How to Sample Efficiently? Offer A ?
Offer B ?
Offer C ?
.
.
.
.
.
Offer Y ?
Offer Z ?


Where
Marketing Applications:
• Websites
• Mobile
• Social Media Campaigns
• Banner Ads
Pharma: Clinical Trials

What is a Multi Armed Bandit
One Armed Bandit –>Slot Machine
The problem:
How to pick between Slot Machines so that
you walk out with most $$$ from Casino at the
end of the Night?

OR

A B

Objective

Pick so as to get the most
return/profit as you can
over time
Technical term: Minimize Regret


Sequential Selection
… but how to Pick? OR

A B

Need to Sample, but do it efficiently


Explore – Collect Data

OR

A B

• Data Collection is costly – an Investment
• Be Efficient – Balance the potential value of
collecting new data with exploiting what
you currently know.


Multi-Armed Bandits

“Bandit problems embody in essential
form a conflict evident in all human
action: choosing actions which yield
immediate reward vs. choosing actions
… whose benefit will come only later.”*
- Peter Whittle

*Source: Qing Zhao, UC Davis. Plenary talk at SPAWC, June, 2010.


Exploration Exploitation

1) Explore/Learn – Try out different actions
to learn how they perform over time – This is
a data collection task.

2) Exploit/Earn – Take advantage of what
you have learned to get highest payoff –
Your current best guess

Not A New Problem
1933 – first work on competing
options
1940 – WWII Problem Allies
attempt to tackle
1953 – Bellman formulates as a
Dynamic Programing problem
Source: http://www.lancs.ac.uk/~winterh/GRhist.html


Testing

• Explore First
– All actions have an equal chance of
selection (uniform random).
– Use hypothesis testing to select a
‘Winner’.
• Then Exploit - Keep only ‘Winner’
for selection


Learn First
Data Collection/Sample Apply Leaning

Explore/ Exploit/
Learn Earn

Time


P-Values: A Digression

P-Values:
• NOT the probability that the Null is
True. P( Null=True| DATA)
• P(DATA (or more extreme)| Null=True)
• Not a great tool for deciding when to
stop sampling
See:
http://andrewgelman.com/2010/09/noooooooooooooo_1/
http://www.stat.duke.edu/~berger/papers/02-01.pdf

A Couple Other Methods

1. Epsilon Greedy
Nice and Simple

2. Upper Confidence Bounds(UCB)
Adapts to Uncertainty


1) Epsilon-Greedy


Greedy

What do you mean by ‘Greedy’?

Make whatever choice seems
best at the moment.

Epsilon Greedy

What do you mean by ‘Epsilon
Greedy’?
• Explore – randomly select action
 percent of the time (say 20%)

• Exploit – Play greedy (pick the
current best) 1-  (say 80%)

Epsilon Greedy
User

Explore/Learn Exploit/Earn
(20%) (80%)

Select Select Current
Randomly Best
Like AB Testing (Be Greedy)


Epsilon Greedy
20% Random 80% Select Best

Action Value
A $5.00
B $4.00
C $3.00
D $2.00
E $1.00

Continuous Sampling

Explore/Learn

Exploit/Earn

Time


Epsilon Greedy

– Super Simple/low cost to implement
– Tends to be surprisingly effective
– Less affected by ‘Seasonality’

– Not optimal (hard to pick best )
– Doesn’t use measure of variance
– Should/How to decrease Exploration over
time? Conductrics twitter: @mgershoff

Upper Confidence Bound

Basic Idea:
1) Calculate both mean and a measure
of uncertainty (variance) for each
action.
2) Make Greedy selections based on
mean + uncertainty bonus


Confidence Interval Review

Confidence Interval = mean +/- z*Std

- 2*Std Mean +2*Std


Upper Confidence

Score each option using the upper
portion of the interval as a Bonus

Mean +Bonus


1) Use upper portion of CI as ‘Bonus’

2) Make Greedy Selections

A Select A
B
C

$0 $5 $10
Estimated Reward


1) Selecting Action ‘A’ reduces uncertainty
bonus (because more data)

2) Action ‘C’ now has highest score

A
B
C Select C

$0 $5 $10
Estimated Reward



• Like A/B Test – uses variance
measure
• Unlike A/B Test – no hypothesis
test
• Automatically Balances
Exploration with Exploitation

Case Study:
Conversion
Treatment Rate Served
V2V3 9.9% 14,893
V2V2 9.7% 9,720
V2V1 8.0% 2,441
V1V3 3.3% 2,090
V2V3 2.6% 1,849
V2V2 2.0% 1,817
V1V1 1.8% 1,926
V3V1 1.8% 1,821
V1V2 1.5% 1,873

Case Study

Test Method Conversion
Rate
Adaptive 7%

Non Adaptive 4.5%


AB Testing V Bandit

Option A ->

Option B ->

Option C ->


Why Should I Care?

• More Efficient Learning

• Automation

• Changing World


Questions?


Thank You!

Matt Gershoff
p) 646-384-5151
e) matt@conductrics.com
t) @mgershoff


�ݺ�ߣ

Conductrics bandit basicsemetrics1016

More Related Content

Conductrics bandit basicsemetrics1016