際際滷

際際滷Share a Scribd company logo
Deep Learning for Unified
Personalized Search
Recommendations
(and Fuzzy Tokenization)
Jake Mannix
Chief Data Engineer, Lucidworks
@pbrane | in/jakemannix
#Activate18 #ActivateSearch
$whoami
 Now: Chief Data Engineer, Lucidworks
 Applied ML / relevance / RecSys
 data engineering
 Previously:
 Allen Institute for AI: research pub. semantic search
 Twitter: account search, user interest modeling, RecSys
 LinkedIn: profile search, generic entity-to-entity RecSys
 Prehistory:
 Other software dev.
 Algebraic topology, particle cosmology
Agenda
 Personalized Search and the Clickstream
 Deep Learning To Rank
 Deep Tokenization for Lucene
Search Relevance Feature Types
 static document priors
 query intent class labels
 query entities
 query / doc text similarity
 personalization (p18n)
 clickstream
 (example Solr query which demonstrates all of these omitted
because it doesnt fit on this slide)
Agenda: getting down to business
 Personalized Search and the Clickstream
 Deep Learning To Rank
 Embeddings
 Text encoding
 p18n
 clickstream
 Objective functions
 Distributed vs Local training
 Query time inference
 Deep Tokenization for Lucene
DL4IR: How I learned to stop worrying and
love deep neural networks
 Non-reasons:
 Always the best ranking results
 c++/CUDA under the hood => superfast inference
 default model works OOTB
 My reasons, as a data engineer:
 Extremely modular, unified framework
 Easily updatable models
 GPU => fewer distributed systems
 Domain Knowledge + Feature Engineering => Naive Vectorization +
Network Architecture Engineering
DL4IR: Why?
 Extremely modular, unified framework. DL models are:
 dissectible: reusable sub-modules
 composable: inputs to other models
 Easily updatable models
 ok, maybe not easy
 (because transfer learning is hard)
 GPU => fewer distributed systems
 GPU=supercomputer, CUDA already written
 Feature Engineering is not repeatable:
 Architecture Engineering is (more or less)
 in DL, features arent free, but are learned
Agenda: Deep LTR
 Deep Learning to Rank
 Embeddings:
 pre-trained
 from scratch
 fine tuned
 Text encoding
 P18n: userId embeddings
 clickstream: docId embeddings
 Objective functions
 Distributed vs Local training
 Query-time inference
Embeddings
 Pre-trained text embeddings:
 GloVe (https://nlp.stanford.edu/projects/glove/)
 NNLM on Google news (https://tfhub.dev/google/nnlm-en-dim128/1)
 fastText (https://fasttext.cc)
 ELMo (https://tfhub.dev/google/elmo/2)
 From scratch
 Many parameters -> lots of training data
 Can be unsupervised first, then treated as above
 Fine-tuned
 Start w/ pre-trained, w/ trainable=False
 Train as usual, but not to convergence
 Re-start training with trainable=True + lower training rate
Embeddings: keras code
Pre-trained embeddings as numpy array of dense vectors (indexed
by token-id), just start building your model like so:
After training, the embedding will be saved with your model, and
you can also extract it out:
Agenda
 Deep Learning to Rank
 Embeddings
 Text encoding:
 chars vs words
 CNNs vs LSTMs
 P18n: userId embeddings
 clickstream: docId embeddings
 Objective functions
 Distributed vs Local training
 Query-time inference
Text encoding
 Characters vs Words:
 word embeddings require lots of data
 Millions of parameters => many GB of training data
 needs good tokenization + preprocessing
 (same in data sci pipeline / at query time!)
 Try char sequences instead!
 sometimes works for old ML
 works on small data
 on raw byte streams (no tokenizers)
 not my clever trick (c.f Zhang, Zhao, LeCun 15)
1d-CNNs vs LSTMs: both operate on sequences
CNN: Convolutional Neural Network: 2d for images, 1d for text
LSTM: Long Short-Term Memory: updates state as it reads, can emit
sequence of states at each position as input for another LSTM:
LSTMs are better, but I CNNs
 LSTMs for text:
 A little harder to understand (boo!)
 (black box)-ish, not much to dissect (yay/boo?)
 Many parameters, needs big data (boo!)
 Not GPU-friendly -> slow to train (boo!)
 Often works OOTB w/ no tuning (yay!)
 Typically SOTA quality after significant tuning (yay!)
 CNNs for text:
 Fairly simple to understand (yay!)
 Easily dissectible (yay!)
 Few parameters, requires less training data (yay!)
 GPU-friendly -> super fast to train (yay!)
 Many many hyperparameters -> hard to tune (boo!)
 Currently not SOTA (boo!) but arent far off (yay!)
 Typically requires more code (boo!)
1D CNN text encoder: keras code
1D CNN text encoder: layer shapes and sizes
p18n features
 Deep Learning to Rank
 Embeddings
 Text encoding
 p18n: userId embeddings
 pre-trained RecSys (ALS) model
 from scratch w/ hashing trick
 clickstream: docId embeddings
 objective functions
 Distributed vs Local training
 Query-time inference
p18n: pre-trained embeddings vs hashing trick
ALS matrix decomposition as pre-trained embedding
from collaborative filtering:
or: just hash UIDs to O(1k) dim (4x: avoid total
collisions) and learn an O(1k) x O(100) embedding for
them
Clickstream features
 Deep Learning to Rank
 Embeddings
 Text encoding
 p18n: userId embeddings
 clickstream: docId embeddings
 same as for userId!
 can overfit easily
 memorizing query/doc history
 (which is sometimes ok)
 Objective functions
 Distributed vs Local training
 Query-time inference
All together now: p18n query/doc CNN ranker
Picture > 1k words
Agenda
 Deep Learning to Rank
 Embeddings
 Text encoding
 p18n: userId embeddings
 clickstream: docId embeddings
 Objective functions:
 Sentiment
 Text classification
 Text generation
 Identity function
 Ranking
 Distributed vs Local training
 Query-time inference
non-classification objectives
 Text generation: Neural Network Language Models (NNLM)
 Predict the next character/word from text
 Identity function: Autoencoder
 Predict the input as output
 Search Ranking: score(query, doc)
 query -click-> doc => score = 1
 query -no-click-> doc => score = 0
 better w/ triplets + curriculum learning:
 Start with random no-click pairs
 Later, pick docs Solr returns for query
 (but got no clicks!)
 eventually: docs w/ less clicks than expected
 (known as hard negative mining)
Agenda
 Deep Learning to Rank
 Embeddings
 Text encoding
 p18n
 clickstream
 Distributed vs Local training
 Query-time inference
Agenda
 Deep Learning to Rank
 Embeddings
 Text encoding
 p18n
 clickstream
 Distributed vs Local training
 Query-time inference
 Ideally: minimal pre/post-processing
 beware of finicky tensor mappings!
 jvm: MLeap TF support
want: simple model serving config:
MLeap source: TF integration
http://mleap-docs.combust.ml/
(also supports SparkML, sklearn,
xgboost, etc)
(and now for something completely different)
Agenda
 Personalized Search and the Clickstream
 Deep Learning to Rank
 Deep Tokens for Lucene
 char-CNN internals
 LSH for discretization
 Hierarchical semantic tokenization
Deep Tokens
 What does a 1d-CNN consume/emit?
 Consumes a sequence (length n) of k-dim vectors
 Emits a sequence of (length n) of f-dim vectors
 (assuming sequences are pre+post-padded)
 If a CNN layers windows are w-wide, require:
 w*k*f parameters (plus biases)
 Activations are often ReLU: >= 0 w/lots of 0s
Deep Tokens: intermediate layers
 1d-CNN feature-vectors
 Consumes a sequence (length n) of k-dim vectors
 Emits a sequence of (length n) of f-dim vectors
 (assuming sequences are pre+post-padded)
 If a CNN layers windows are w-wide, require:
 w*k*f parameters (plus biases)
 Activations are often ReLU: >= 0 w/lots of 0s
 How to get this data?
 activs = [enc.layer[3], enc.layer[5]]
 extractor = Model(input=enc.inputs, output=activs)
1d-char CNN feature vectors by layer
 layer 0:
 Learns simple features like word suffixes, simple morphology, spacing, etc
 layer 1:
 slightly more features like word roots, articles, pronouns, etc
 layer 2:
 complex features: words + common misspellings, hyphenations/concatenations
 layer n:
 Every time you pool + stride over previous layer, effective window grows by factor of
pool_size
How deep can a char-CNN go?!?
 Very Deep Convolutional Networks for Text Classification,
Conneau, Schwenk, LeCun, Barrault; 17
 very small (3char) windows, low filter count (64) early on
 temporal version of VGG architecture
 29 layers, input as long as 1k chars
 Trained on 100k-3M docs
 2.5 days on single GPU
 (I dont know if this works for ranking)
 Locality Sensitive Hash to int codes
 dense vector becomes 16-24 bit int
 text => List[Int] at each layer
 Layer 0: same length as input
 Layer N+1 after k-pooling: len(layer_n.output)/k
 Indexing List[Int] is easy!
 makes sense to an inverted index
 Query time
 Query => List[Int] per layer
 search as usual (with sparsity!)
What can we do with these vectors?
LSH in 30 seconds:
 Random projections preserve
distances on account of:
 Johnson-Lindenstrauss
lemma
 Can pick totally random vectors
 Or: random sample of 2K
vectors from your dataset,
project via pi = vi - vi+1
Deep Tokens: sample similar char-ngrams
 Trained 7-layer char-CNN ranker on 3M BestBuy ecommerce clicks (from Kaggle)
 64-256 feature maps
 quasi-hard negative mining by taking docs returned by Solr but with no clicks
 Example ngrams similar at layer 3-ish or so:
 similar:  rin, e ri, rinf
 From: lord of the ring, LOTR extended edition dvd, lord of the rinfs extended
 and:
 0 in, 0in ,  nch, inch
 From: 70 inch lcd, 55 nch tv, 90in sony tv
 and:
 s z 8,  zs8 ,  sz8 , lumix
 From: panasonic lumix s z 8, lumix zs8, panasonic dmc-zs8s
 longer strings similar at layer 2 levels deeper:
 10.1inches, lnch, inchplasma, inch
 Still to do: full measurement of full DL ranking vs. approximate multilayer search on these
tokens, while sweeping the hyperparameter space and hashing strategies
Deep tokens: challenges
 Stability:
 Once model + LSH family is chosen, this is like choosing an Analyzer - changing requires
full reindex
 Hash functions which are optimal for one data set may be bad after indexing much more
data
 Similarity on differing scales with same semantics
 i.e. 55in and fifty five inch
 (shortcut CNN connections needed?)
 Stop words
 want: no hash bucket (i.e. posting list) at any level have > 10% of corpus
 Noisy tokens at earlier levels (maybe never index first 3?)
 More generally
 precision vs. recall tradeoff tuning
Related work: Xu, et al, CNNs for Text Hashing (IJCAI 15)
and many more (but none with as fun an acronym)
Deep Tokens: TL;DR
 configure model w/ deep char-CNN-based ranker w/search relevance loss
 Train it as usual
 Configure a convolutional feature extractor (CFE)
 From documents:
 Extract convolutional activations
 (learned textual features!)
 LSH -> discrete buckets (abstract tokens)
 Index these tokens
 At query time, use this CFE for:
 posting-list friendly deeply fuzzy search!
 (because really, just have a very fancy tokenizer)
 N.B. char-cnn models are small (O(100-300k) params
Thank you!
Jake Mannix
Chief Data Engineer, Lucidworks
@pbrane
#Activate18 #ActivateSearch
References:
 Coming soon

More Related Content

Deep Learning for Search: Personalization and Deep Tokenization

  • 1. Deep Learning for Unified Personalized Search Recommendations (and Fuzzy Tokenization) Jake Mannix Chief Data Engineer, Lucidworks @pbrane | in/jakemannix #Activate18 #ActivateSearch
  • 2. $whoami Now: Chief Data Engineer, Lucidworks Applied ML / relevance / RecSys data engineering Previously: Allen Institute for AI: research pub. semantic search Twitter: account search, user interest modeling, RecSys LinkedIn: profile search, generic entity-to-entity RecSys Prehistory: Other software dev. Algebraic topology, particle cosmology
  • 3. Agenda Personalized Search and the Clickstream Deep Learning To Rank Deep Tokenization for Lucene
  • 4. Search Relevance Feature Types static document priors query intent class labels query entities query / doc text similarity personalization (p18n) clickstream (example Solr query which demonstrates all of these omitted because it doesnt fit on this slide)
  • 5. Agenda: getting down to business Personalized Search and the Clickstream Deep Learning To Rank Embeddings Text encoding p18n clickstream Objective functions Distributed vs Local training Query time inference Deep Tokenization for Lucene
  • 6. DL4IR: How I learned to stop worrying and love deep neural networks Non-reasons: Always the best ranking results c++/CUDA under the hood => superfast inference default model works OOTB My reasons, as a data engineer: Extremely modular, unified framework Easily updatable models GPU => fewer distributed systems Domain Knowledge + Feature Engineering => Naive Vectorization + Network Architecture Engineering
  • 7. DL4IR: Why? Extremely modular, unified framework. DL models are: dissectible: reusable sub-modules composable: inputs to other models Easily updatable models ok, maybe not easy (because transfer learning is hard) GPU => fewer distributed systems GPU=supercomputer, CUDA already written Feature Engineering is not repeatable: Architecture Engineering is (more or less) in DL, features arent free, but are learned
  • 8. Agenda: Deep LTR Deep Learning to Rank Embeddings: pre-trained from scratch fine tuned Text encoding P18n: userId embeddings clickstream: docId embeddings Objective functions Distributed vs Local training Query-time inference
  • 9. Embeddings Pre-trained text embeddings: GloVe (https://nlp.stanford.edu/projects/glove/) NNLM on Google news (https://tfhub.dev/google/nnlm-en-dim128/1) fastText (https://fasttext.cc) ELMo (https://tfhub.dev/google/elmo/2) From scratch Many parameters -> lots of training data Can be unsupervised first, then treated as above Fine-tuned Start w/ pre-trained, w/ trainable=False Train as usual, but not to convergence Re-start training with trainable=True + lower training rate
  • 10. Embeddings: keras code Pre-trained embeddings as numpy array of dense vectors (indexed by token-id), just start building your model like so: After training, the embedding will be saved with your model, and you can also extract it out:
  • 11. Agenda Deep Learning to Rank Embeddings Text encoding: chars vs words CNNs vs LSTMs P18n: userId embeddings clickstream: docId embeddings Objective functions Distributed vs Local training Query-time inference
  • 12. Text encoding Characters vs Words: word embeddings require lots of data Millions of parameters => many GB of training data needs good tokenization + preprocessing (same in data sci pipeline / at query time!) Try char sequences instead! sometimes works for old ML works on small data on raw byte streams (no tokenizers) not my clever trick (c.f Zhang, Zhao, LeCun 15)
  • 13. 1d-CNNs vs LSTMs: both operate on sequences CNN: Convolutional Neural Network: 2d for images, 1d for text LSTM: Long Short-Term Memory: updates state as it reads, can emit sequence of states at each position as input for another LSTM:
  • 14. LSTMs are better, but I CNNs LSTMs for text: A little harder to understand (boo!) (black box)-ish, not much to dissect (yay/boo?) Many parameters, needs big data (boo!) Not GPU-friendly -> slow to train (boo!) Often works OOTB w/ no tuning (yay!) Typically SOTA quality after significant tuning (yay!) CNNs for text: Fairly simple to understand (yay!) Easily dissectible (yay!) Few parameters, requires less training data (yay!) GPU-friendly -> super fast to train (yay!) Many many hyperparameters -> hard to tune (boo!) Currently not SOTA (boo!) but arent far off (yay!) Typically requires more code (boo!)
  • 15. 1D CNN text encoder: keras code
  • 16. 1D CNN text encoder: layer shapes and sizes
  • 17. p18n features Deep Learning to Rank Embeddings Text encoding p18n: userId embeddings pre-trained RecSys (ALS) model from scratch w/ hashing trick clickstream: docId embeddings objective functions Distributed vs Local training Query-time inference
  • 18. p18n: pre-trained embeddings vs hashing trick ALS matrix decomposition as pre-trained embedding from collaborative filtering: or: just hash UIDs to O(1k) dim (4x: avoid total collisions) and learn an O(1k) x O(100) embedding for them
  • 19. Clickstream features Deep Learning to Rank Embeddings Text encoding p18n: userId embeddings clickstream: docId embeddings same as for userId! can overfit easily memorizing query/doc history (which is sometimes ok) Objective functions Distributed vs Local training Query-time inference
  • 20. All together now: p18n query/doc CNN ranker
  • 21. Picture > 1k words
  • 22. Agenda Deep Learning to Rank Embeddings Text encoding p18n: userId embeddings clickstream: docId embeddings Objective functions: Sentiment Text classification Text generation Identity function Ranking Distributed vs Local training Query-time inference
  • 23. non-classification objectives Text generation: Neural Network Language Models (NNLM) Predict the next character/word from text Identity function: Autoencoder Predict the input as output Search Ranking: score(query, doc) query -click-> doc => score = 1 query -no-click-> doc => score = 0 better w/ triplets + curriculum learning: Start with random no-click pairs Later, pick docs Solr returns for query (but got no clicks!) eventually: docs w/ less clicks than expected (known as hard negative mining)
  • 24. Agenda Deep Learning to Rank Embeddings Text encoding p18n clickstream Distributed vs Local training Query-time inference
  • 25. Agenda Deep Learning to Rank Embeddings Text encoding p18n clickstream Distributed vs Local training Query-time inference Ideally: minimal pre/post-processing beware of finicky tensor mappings! jvm: MLeap TF support
  • 26. want: simple model serving config:
  • 27. MLeap source: TF integration http://mleap-docs.combust.ml/ (also supports SparkML, sklearn, xgboost, etc)
  • 28. (and now for something completely different)
  • 29. Agenda Personalized Search and the Clickstream Deep Learning to Rank Deep Tokens for Lucene char-CNN internals LSH for discretization Hierarchical semantic tokenization
  • 30. Deep Tokens What does a 1d-CNN consume/emit? Consumes a sequence (length n) of k-dim vectors Emits a sequence of (length n) of f-dim vectors (assuming sequences are pre+post-padded) If a CNN layers windows are w-wide, require: w*k*f parameters (plus biases) Activations are often ReLU: >= 0 w/lots of 0s
  • 31. Deep Tokens: intermediate layers 1d-CNN feature-vectors Consumes a sequence (length n) of k-dim vectors Emits a sequence of (length n) of f-dim vectors (assuming sequences are pre+post-padded) If a CNN layers windows are w-wide, require: w*k*f parameters (plus biases) Activations are often ReLU: >= 0 w/lots of 0s How to get this data? activs = [enc.layer[3], enc.layer[5]] extractor = Model(input=enc.inputs, output=activs)
  • 32. 1d-char CNN feature vectors by layer layer 0: Learns simple features like word suffixes, simple morphology, spacing, etc layer 1: slightly more features like word roots, articles, pronouns, etc layer 2: complex features: words + common misspellings, hyphenations/concatenations layer n: Every time you pool + stride over previous layer, effective window grows by factor of pool_size
  • 33. How deep can a char-CNN go?!? Very Deep Convolutional Networks for Text Classification, Conneau, Schwenk, LeCun, Barrault; 17 very small (3char) windows, low filter count (64) early on temporal version of VGG architecture 29 layers, input as long as 1k chars Trained on 100k-3M docs 2.5 days on single GPU (I dont know if this works for ranking)
  • 34. Locality Sensitive Hash to int codes dense vector becomes 16-24 bit int text => List[Int] at each layer Layer 0: same length as input Layer N+1 after k-pooling: len(layer_n.output)/k Indexing List[Int] is easy! makes sense to an inverted index Query time Query => List[Int] per layer search as usual (with sparsity!) What can we do with these vectors?
  • 35. LSH in 30 seconds: Random projections preserve distances on account of: Johnson-Lindenstrauss lemma Can pick totally random vectors Or: random sample of 2K vectors from your dataset, project via pi = vi - vi+1
  • 36. Deep Tokens: sample similar char-ngrams Trained 7-layer char-CNN ranker on 3M BestBuy ecommerce clicks (from Kaggle) 64-256 feature maps quasi-hard negative mining by taking docs returned by Solr but with no clicks Example ngrams similar at layer 3-ish or so: similar: rin, e ri, rinf From: lord of the ring, LOTR extended edition dvd, lord of the rinfs extended and: 0 in, 0in , nch, inch From: 70 inch lcd, 55 nch tv, 90in sony tv and: s z 8, zs8 , sz8 , lumix From: panasonic lumix s z 8, lumix zs8, panasonic dmc-zs8s longer strings similar at layer 2 levels deeper: 10.1inches, lnch, inchplasma, inch Still to do: full measurement of full DL ranking vs. approximate multilayer search on these tokens, while sweeping the hyperparameter space and hashing strategies
  • 37. Deep tokens: challenges Stability: Once model + LSH family is chosen, this is like choosing an Analyzer - changing requires full reindex Hash functions which are optimal for one data set may be bad after indexing much more data Similarity on differing scales with same semantics i.e. 55in and fifty five inch (shortcut CNN connections needed?) Stop words want: no hash bucket (i.e. posting list) at any level have > 10% of corpus Noisy tokens at earlier levels (maybe never index first 3?) More generally precision vs. recall tradeoff tuning
  • 38. Related work: Xu, et al, CNNs for Text Hashing (IJCAI 15) and many more (but none with as fun an acronym)
  • 39. Deep Tokens: TL;DR configure model w/ deep char-CNN-based ranker w/search relevance loss Train it as usual Configure a convolutional feature extractor (CFE) From documents: Extract convolutional activations (learned textual features!) LSH -> discrete buckets (abstract tokens) Index these tokens At query time, use this CFE for: posting-list friendly deeply fuzzy search! (because really, just have a very fancy tokenizer) N.B. char-cnn models are small (O(100-300k) params
  • 40. Thank you! Jake Mannix Chief Data Engineer, Lucidworks @pbrane #Activate18 #ActivateSearch