�ݺ�ߣ

Joel Grus

Chief Scientist, VoloMetrix
@joelgrus

About Me
• Chief Scientist at VoloMetrix
• Have a 2-year-old daughter
• Did not take me long to discover that “boys” clothing is fun, “girls”
clothing kind of sucks

Typical
“Toddler Boys”
Shirt

Typical
“Toddler Girls”
Shirt

Obvious to us, but can a computer figure it
out?

The Data
• Downloaded image of every “toddler boys” and “toddler girls” t-shirt from
•
•
•
•
•
•
•

Carters
Children’s Place
Crazy 8
Gap Kids
Gymboree
Old Navy
Target.

• 616 images of boys shirts and 446 images of girls shirts
• The goal: to build a model that predicts “boy shirt” or “girl shirt” just based
on the images!

Attempt #1: Colors
• Each image is a collection of RGB pixels
• There are 256 * 256 * 256 ~ 17 million possible colors (too many)
• Bucket each of R, G, B into [0,85), [85,170), or [170,255)
• This gives 3 * 3 * 3 = 27 possible colors
• Use features “does image contain at least one pixel of color j?”
• Train logistic regression model on 80% of shirts, test on other 20%

Color Model Performance

P(girl shirt | “girl shirt”) = 75%
P(boy shirt | “boy shirt”) = 77%
P(“girl shirt” | girl shirt) = 63%
P(“boy shirt” | boy shirt) = 86%

# of shirts
(boys)
(girls)

“Confidence Score” ( > 0 “boy shirt”, < 0 “girl shirt”)

“girlier”

less colorful

“boyier”

more colorful

Attempt #2: Eigenshirts
• To compare images, rescale all of them to 138 x 138
• Chose this size because many were 138 x 138 already
• Others mostly bigger

• Using R, G, B as coordinates for each pixel, think of each image as a
point in 138 * 138 * 3 = 57,132-dimensional space
• Obviously, with 57k features and only 1,000 shirts, this will overfit
• Use dimensionality reduction to find the 10 most “interesting”
dimensions, project shirts into 10-d subspace, build model there
• Each subspace dimension determines a (Platonic ideal) “eigenshirt”

T shirts, feminism, parenting, and data science

What does projection look like?

Almost all miscategorized shirts have
weak predictions (overall 93% accuracy)

Future Directions
• Look at text on shirt (but too lazy to transcribe it)
• Try to make images same size / background color
• Build model to predict how “fun” a shirt is (but will require tedious
hand-labeling)
• ??

More info
• Code (but not data) is on https://github.com/joelgrus/shirts
• Two blog posts on joelgrus.com, both linked from the github README
(or Google them, they have the same title as this talk)
• Follow me on twitter: @joelgrus

�ݺ�ߣ

T shirts, feminism, parenting, and data science

More Related Content

T shirts, feminism, parenting, and data science