際際滷s for my Activate 2018 presentation on using Deep Learning in search, for two different topics: personalized search / recommendations, and then "learning to tokenize".
Convert to study guideBETA
Transform any presentation into a summarized study guide, highlighting the most important points and key insights.
1 of 41
Download to read offline
More Related Content
Deep Learning for Search: Personalization and Deep Tokenization
1. Deep Learning for Unified
Personalized Search
Recommendations
(and Fuzzy Tokenization)
Jake Mannix
Chief Data Engineer, Lucidworks
@pbrane | in/jakemannix
#Activate18 #ActivateSearch
2. $whoami
Now: Chief Data Engineer, Lucidworks
Applied ML / relevance / RecSys
data engineering
Previously:
Allen Institute for AI: research pub. semantic search
Twitter: account search, user interest modeling, RecSys
LinkedIn: profile search, generic entity-to-entity RecSys
Prehistory:
Other software dev.
Algebraic topology, particle cosmology
4. Search Relevance Feature Types
static document priors
query intent class labels
query entities
query / doc text similarity
personalization (p18n)
clickstream
(example Solr query which demonstrates all of these omitted
because it doesnt fit on this slide)
5. Agenda: getting down to business
Personalized Search and the Clickstream
Deep Learning To Rank
Embeddings
Text encoding
p18n
clickstream
Objective functions
Distributed vs Local training
Query time inference
Deep Tokenization for Lucene
6. DL4IR: How I learned to stop worrying and
love deep neural networks
Non-reasons:
Always the best ranking results
c++/CUDA under the hood => superfast inference
default model works OOTB
My reasons, as a data engineer:
Extremely modular, unified framework
Easily updatable models
GPU => fewer distributed systems
Domain Knowledge + Feature Engineering => Naive Vectorization +
Network Architecture Engineering
7. DL4IR: Why?
Extremely modular, unified framework. DL models are:
dissectible: reusable sub-modules
composable: inputs to other models
Easily updatable models
ok, maybe not easy
(because transfer learning is hard)
GPU => fewer distributed systems
GPU=supercomputer, CUDA already written
Feature Engineering is not repeatable:
Architecture Engineering is (more or less)
in DL, features arent free, but are learned
8. Agenda: Deep LTR
Deep Learning to Rank
Embeddings:
pre-trained
from scratch
fine tuned
Text encoding
P18n: userId embeddings
clickstream: docId embeddings
Objective functions
Distributed vs Local training
Query-time inference
9. Embeddings
Pre-trained text embeddings:
GloVe (https://nlp.stanford.edu/projects/glove/)
NNLM on Google news (https://tfhub.dev/google/nnlm-en-dim128/1)
fastText (https://fasttext.cc)
ELMo (https://tfhub.dev/google/elmo/2)
From scratch
Many parameters -> lots of training data
Can be unsupervised first, then treated as above
Fine-tuned
Start w/ pre-trained, w/ trainable=False
Train as usual, but not to convergence
Re-start training with trainable=True + lower training rate
10. Embeddings: keras code
Pre-trained embeddings as numpy array of dense vectors (indexed
by token-id), just start building your model like so:
After training, the embedding will be saved with your model, and
you can also extract it out:
11. Agenda
Deep Learning to Rank
Embeddings
Text encoding:
chars vs words
CNNs vs LSTMs
P18n: userId embeddings
clickstream: docId embeddings
Objective functions
Distributed vs Local training
Query-time inference
12. Text encoding
Characters vs Words:
word embeddings require lots of data
Millions of parameters => many GB of training data
needs good tokenization + preprocessing
(same in data sci pipeline / at query time!)
Try char sequences instead!
sometimes works for old ML
works on small data
on raw byte streams (no tokenizers)
not my clever trick (c.f Zhang, Zhao, LeCun 15)
13. 1d-CNNs vs LSTMs: both operate on sequences
CNN: Convolutional Neural Network: 2d for images, 1d for text
LSTM: Long Short-Term Memory: updates state as it reads, can emit
sequence of states at each position as input for another LSTM:
14. LSTMs are better, but I CNNs
LSTMs for text:
A little harder to understand (boo!)
(black box)-ish, not much to dissect (yay/boo?)
Many parameters, needs big data (boo!)
Not GPU-friendly -> slow to train (boo!)
Often works OOTB w/ no tuning (yay!)
Typically SOTA quality after significant tuning (yay!)
CNNs for text:
Fairly simple to understand (yay!)
Easily dissectible (yay!)
Few parameters, requires less training data (yay!)
GPU-friendly -> super fast to train (yay!)
Many many hyperparameters -> hard to tune (boo!)
Currently not SOTA (boo!) but arent far off (yay!)
Typically requires more code (boo!)
17. p18n features
Deep Learning to Rank
Embeddings
Text encoding
p18n: userId embeddings
pre-trained RecSys (ALS) model
from scratch w/ hashing trick
clickstream: docId embeddings
objective functions
Distributed vs Local training
Query-time inference
18. p18n: pre-trained embeddings vs hashing trick
ALS matrix decomposition as pre-trained embedding
from collaborative filtering:
or: just hash UIDs to O(1k) dim (4x: avoid total
collisions) and learn an O(1k) x O(100) embedding for
them
19. Clickstream features
Deep Learning to Rank
Embeddings
Text encoding
p18n: userId embeddings
clickstream: docId embeddings
same as for userId!
can overfit easily
memorizing query/doc history
(which is sometimes ok)
Objective functions
Distributed vs Local training
Query-time inference
22. Agenda
Deep Learning to Rank
Embeddings
Text encoding
p18n: userId embeddings
clickstream: docId embeddings
Objective functions:
Sentiment
Text classification
Text generation
Identity function
Ranking
Distributed vs Local training
Query-time inference
23. non-classification objectives
Text generation: Neural Network Language Models (NNLM)
Predict the next character/word from text
Identity function: Autoencoder
Predict the input as output
Search Ranking: score(query, doc)
query -click-> doc => score = 1
query -no-click-> doc => score = 0
better w/ triplets + curriculum learning:
Start with random no-click pairs
Later, pick docs Solr returns for query
(but got no clicks!)
eventually: docs w/ less clicks than expected
(known as hard negative mining)
24. Agenda
Deep Learning to Rank
Embeddings
Text encoding
p18n
clickstream
Distributed vs Local training
Query-time inference
25. Agenda
Deep Learning to Rank
Embeddings
Text encoding
p18n
clickstream
Distributed vs Local training
Query-time inference
Ideally: minimal pre/post-processing
beware of finicky tensor mappings!
jvm: MLeap TF support
29. Agenda
Personalized Search and the Clickstream
Deep Learning to Rank
Deep Tokens for Lucene
char-CNN internals
LSH for discretization
Hierarchical semantic tokenization
30. Deep Tokens
What does a 1d-CNN consume/emit?
Consumes a sequence (length n) of k-dim vectors
Emits a sequence of (length n) of f-dim vectors
(assuming sequences are pre+post-padded)
If a CNN layers windows are w-wide, require:
w*k*f parameters (plus biases)
Activations are often ReLU: >= 0 w/lots of 0s
31. Deep Tokens: intermediate layers
1d-CNN feature-vectors
Consumes a sequence (length n) of k-dim vectors
Emits a sequence of (length n) of f-dim vectors
(assuming sequences are pre+post-padded)
If a CNN layers windows are w-wide, require:
w*k*f parameters (plus biases)
Activations are often ReLU: >= 0 w/lots of 0s
How to get this data?
activs = [enc.layer[3], enc.layer[5]]
extractor = Model(input=enc.inputs, output=activs)
32. 1d-char CNN feature vectors by layer
layer 0:
Learns simple features like word suffixes, simple morphology, spacing, etc
layer 1:
slightly more features like word roots, articles, pronouns, etc
layer 2:
complex features: words + common misspellings, hyphenations/concatenations
layer n:
Every time you pool + stride over previous layer, effective window grows by factor of
pool_size
33. How deep can a char-CNN go?!?
Very Deep Convolutional Networks for Text Classification,
Conneau, Schwenk, LeCun, Barrault; 17
very small (3char) windows, low filter count (64) early on
temporal version of VGG architecture
29 layers, input as long as 1k chars
Trained on 100k-3M docs
2.5 days on single GPU
(I dont know if this works for ranking)
34. Locality Sensitive Hash to int codes
dense vector becomes 16-24 bit int
text => List[Int] at each layer
Layer 0: same length as input
Layer N+1 after k-pooling: len(layer_n.output)/k
Indexing List[Int] is easy!
makes sense to an inverted index
Query time
Query => List[Int] per layer
search as usual (with sparsity!)
What can we do with these vectors?
35. LSH in 30 seconds:
Random projections preserve
distances on account of:
Johnson-Lindenstrauss
lemma
Can pick totally random vectors
Or: random sample of 2K
vectors from your dataset,
project via pi = vi - vi+1
36. Deep Tokens: sample similar char-ngrams
Trained 7-layer char-CNN ranker on 3M BestBuy ecommerce clicks (from Kaggle)
64-256 feature maps
quasi-hard negative mining by taking docs returned by Solr but with no clicks
Example ngrams similar at layer 3-ish or so:
similar: rin, e ri, rinf
From: lord of the ring, LOTR extended edition dvd, lord of the rinfs extended
and:
0 in, 0in , nch, inch
From: 70 inch lcd, 55 nch tv, 90in sony tv
and:
s z 8, zs8 , sz8 , lumix
From: panasonic lumix s z 8, lumix zs8, panasonic dmc-zs8s
longer strings similar at layer 2 levels deeper:
10.1inches, lnch, inchplasma, inch
Still to do: full measurement of full DL ranking vs. approximate multilayer search on these
tokens, while sweeping the hyperparameter space and hashing strategies
37. Deep tokens: challenges
Stability:
Once model + LSH family is chosen, this is like choosing an Analyzer - changing requires
full reindex
Hash functions which are optimal for one data set may be bad after indexing much more
data
Similarity on differing scales with same semantics
i.e. 55in and fifty five inch
(shortcut CNN connections needed?)
Stop words
want: no hash bucket (i.e. posting list) at any level have > 10% of corpus
Noisy tokens at earlier levels (maybe never index first 3?)
More generally
precision vs. recall tradeoff tuning
38. Related work: Xu, et al, CNNs for Text Hashing (IJCAI 15)
and many more (but none with as fun an acronym)
39. Deep Tokens: TL;DR
configure model w/ deep char-CNN-based ranker w/search relevance loss
Train it as usual
Configure a convolutional feature extractor (CFE)
From documents:
Extract convolutional activations
(learned textual features!)
LSH -> discrete buckets (abstract tokens)
Index these tokens
At query time, use this CFE for:
posting-list friendly deeply fuzzy search!
(because really, just have a very fancy tokenizer)
N.B. char-cnn models are small (O(100-300k) params