The document discusses building decision trees for classification using Scala and Hadoop. It demonstrates the process of learning a decision tree over multiple MapReduce steps to split a sample data set into groups based on attributes like color and height. Key aspects covered include using the Scalding and Algebird frameworks to implement a generic decision tree learner that can handle both binary and multi-class classification problems over large datasets in a distributed manner.
12. Do you like cookies?
color != blue color = blue
yuck ok
cookie!
wears != stripes
wears = stripes
13. color != blue color = blue
T T
T
wears != stripes
wears = stripes
14. color != blue color = blue
T T
T
wears != stripes
wears = stripes
Do you like cookies?
How many cookies will you eat?
Whats your favorite kind of cookie?
15. Bootstrap or k-fold?
Chi-square or entropy?
Wow!
Classification or regression?
Binary splits or multiway?
Out-of-bag
or out-of-time?
One tree or
many?
Binary or multi-class?
16. trait Evaluator[V,T]
trait Tree[V,T]
trait Splitter[V,T]
trait Error[T,E]
Wow!
Such types!
case class Instance[V,T]
34. Instance[V,T]
Instance[V,T]
Instance[V,T]
Instance[V,T]
Instance[V,T]
Instance[V,T]
Instance[V,T]
S
S
S
S
S
S
S
S
Map Reduce
Step 2/21
S
S
Other options:
CountMinSketch
QTree
38. Instance[V,T]
Instance[V,T]
Instance[V,T]
Instance[V,T]
Instance[V,T]
Instance[V,T]
Instance[V,T]
S
S
S
S
S
S
S
S
Step 3/21
S
S
S Split[V,T] Split[V,T]
Split[V,T]
Split[V,T]
40. Instance[V,T]
Instance[V,T]
Instance[V,T]
Instance[V,T]
Instance[V,T]
Instance[V,T]
Instance[V,T]
S
S
S
S
S
S
S
S
S
Step 3/21
S
S
S Split[V,T] Split[V,T]
Split[V,T]
Split[V,T]
S
S
S
S
S
S
41. Instance[V,T]
Instance[V,T]
Instance[V,T]
Instance[V,T]
Instance[V,T]
Instance[V,T]
Instance[V,T]
S
S
S
S
S
S
S
S
S
Step 3/21
S
S
S Split[V,T] Split[V,T]
Split[V,T]
Split[V,T]
S
S
S
S
S
S
S
S
S
S Split[V,T]
Split[V,T]
Split[V,T]