狠狠撸

LEARNING FROM SETS
ANDREW CLEGG

IN A NUTSHELL
ABOUT ME
? Yelp (starting next week!)
? Etsy, Pearson, Last.fm,
AstraZeneca, consulting
? Bioinformatics, information
retrieval, natural language
processing (UCL/Birkbeck)
? Main interests: search,
recommendations,
personalization
? @andrew_clegg
? http://andrewclegg.org/

LEARNING DEEP
REPRESENTATIONS FOR
UNORDERED ITEM SETS

LEARNING FROM ITEM COLLECTIONS
PROBLEM STATEMENT
? A lot of real-world data consists of collections of objects
? User’s session on a website (list of events)
? Products in a shopping cart (bag of items)
? Product titles (list of words)
? Songs played in a user’s history (list of items)
? Movies liked in a user’s signup flow (set of items)

PROBLEM STATEMENT
? A lot of real-world data consists of collections of objects
? User’s session on a website (list of events) — ORDERED
? Products in a shopping cart (bag of items) — ORDERED OR NOT
? Product titles (list of words) — ORDERED… OR NOT?
? Songs played in a user’s history (list of items) — ORDERED
? Movies liked in a user’s signup flow (set of items) — UNORDERED

PROBLEM STATEMENT
? Learning representations for variable-length sequences is “easy”
? RNNs, LSTMs, GRUs
? Input = sequence of embeddings
? Output = embedding for whole sequence
? Very effective but not always the cheapest or easiest to train
? But what if the data is unordered?
? What if it’s ordered, but that ordering is uninformative?

HOW CAN WE LEARN A SINGLE
EMBEDDING FROM A BAG OR
SET OF ITEM EMBEDDINGS?

(WHICH MIGHT NOT WORK VERY WELL)
REALLY SIMPLE APPROACH
? Learn item embeddings in an unsupervised manner
? e.g. “Item2Vec”, Barkan & Koenigstein 2016
? word2vec (skip-gram with negative sampling) on item IDs
? Average them together to get an embedding for the set/bag
? Often used in text mining / IR as a baseline or lower bound
? e.g. “word centroid distance” from Kusner et al 2015

Embeddings
Item 05
Item 17
Item 23 Element-wise mean
Issues:
? Not task oriented
? Embeddings can’t adapt to problem domain
? No guarantee that taking the mean is the best strategy

LEARN EMBEDDINGS WHILE TRAINING ON A TASK
NEURAL BAG-OF-ITEMS
? Common baseline in NLP tasks: neural bag-of-words
? Initialize embeddings randomly
? Or from unsupervised pre-training, or third-party data
? Take mean (or sometimes sum)
? Feed into network, update embeddings via backprop

Embeddings
Item 05
Item 17
Output layer or rest of network
Errors propagate back into embeddings

COMPOSE EMBEDDINGS VIA NON-LINEAR TRANSFORMATIONS
DEEP AVERAGING NETWORKS
? “Deep Unordered Composition Rivals Syntactic Methods for Text
Classification” (Iyyer et al 2015)
? Developed for sentiment classification & question answering
? Proposed as a cheap alternative to recursive neural networks
? In a nutshell:
? Don’t use mean of embeddings directly
? Take mean and pass it through some fully-connected layers
? Probably prior art somewhere?

Embeddings
Item 05
Item 17
Output layer or rest of network
Errors propagate back into FC layers and embeddings
FC2
FC1
Activation of last FC layer is representation of whole set

— Iyyer et al
THE DEEP LAYERS OF THE DAN
AMPLIFY TINY DIFFERENCES IN THE
VECTOR AVERAGE THAT ARE
PREDICTIVE OF THE OUTPUT LABELS.
”
“

“I really loved Rosamund Pike’s performance in the movie Gone Girl”

liked

liked
despised

despised
All three sentences have very similar vector mean
liked

REMOVING ENTIRE EMBEDDINGS FROM THE MEAN
WORD DROPOUT
? Additional contribution: alternative dropout scheme
? Don’t add dropout after fully-connected layers
? Instead, randomly drop words from the input sentences
? Maybe somewhat specific to sentiment and question answering?
? Most words in a sentence don’t affect the sentiment
? Most words in a sentence don’t describe the actual answer

DEEP AVERAGING
NETWORKS FOR
ECOMMERCE DATA

PREDICTING GROCERY RE-ORDERS
INSTACART KAGGLE CONTEST
Simplified version of task, for trying out DANs:
? Given previous order (n of ~50K products)…
? Predict what % of items in it will be re-ordered in next order
? Use only the items in the previous order (not user, metadata etc.)

TRAIN ON 2893386 SAMPLES, VALIDATE ON 321488 SAMPLES
DAN VS GRU HEAD-TO-HEAD
DAN input: unordered item IDs
Dim-50 item embedding
(2484450 trainable params)
Mean + 2x dim-50 dense ReLU layers
Single linear output
GRU input: ordered item IDs
Dim-50 item embedding
GRU with 25 units + ReLU activation
Single linear output

TRAINED WITH ADAM (ALL DEFAULTS) ON GOOGLE GPU BOX
DAN
Batch size: 100
MSE loss
One epoch: 4 minutes
Mean training loss: 0.0631
Validation loss: 0.0626
Competitive result in minutes
GRU
Batch size: 100
MSE loss
One epoch: 5 hours
Mean training loss: 0.0626
Validation loss: 0.0614
Slightly better result… in hours!

DAN MATCHED GRU PERFORMANCE IN 12 MINUTES
DANLOSS
0.0568
0.0586
0.0604
0.0622
0.0640
EPOCH
1 2 3 4 5
VALIDATION TRAINING
0.0615 ≈ GRU performance after 5 hours

SOME REMARKS
? Tried ‘neural bag-of-items’ (no hidden layers) for comparison
? Training time per epoch similar to DAN (few secs faster)
? Validation loss flattened out at 0.063 (worse than DAN at epoch 0)
? Not a thorough investigation — no hyperparameter search
? No dropout, weight decay, batch norm, etc.
? Item dropout (i.e. word dropout) didn’t seem to help
? Unlike text mining tasks, all items in bag are (potentially) important

ANY QUESTIONS?
THANKS!
? Code available on GitHub:
? andrewclegg/insta-keras
? Feel free to grab me
afterwards to chat about
anything
? Or ping me on Twitter:
? @andrew_clegg

狠狠撸

Applied AI - 2017-07-11 - Learning From Sets

Recommended

More Related Content

Similar to Applied AI - 2017-07-11 - Learning From Sets (20)

Recently uploaded (20)

Applied AI - 2017-07-11 - Learning From Sets