際際滷

際際滷Share a Scribd company logo
Automatic Search Event
by
Automatic Keyword Extraction
Xiwei Yan
08-10-2016
Overview
Ads landing pages Html source code Text
Keyword & Key PhrasesSimilar WebpagesAudience
Motivation
 Automate the search events (free BA from
manually generating the keywords)
 Identify users for campaigns that dont have
pixels
A First Glimpse at Result
Approach
 Preprocessing
 Keyword Extraction models
 TF-IDF
 TextRank
 Word2Vec + TextRank
 TextRank + Word2Vec
Approach 1 - TFIDF
 Preprocessing
 Lower case, lemmatize, stop words, punctuation, tokenization, tag and
filter by part-of-speech tags
 Keyword Extraction models
 TF-IDF
 TF-IDF(w, d, n, N) = TF(w, d) * IDF(n, N)
 TF(w, d) = # times word w occurred
in doc d
 IDF(n, N) = # docs the word w appears
Word Term
freq in
doc1
Appear
in #
docs
Tfidf
car 27 3 0
auto 3 2 1.216
Insurance 0 2 0
Best 14 2 5.676
Approach 2 -
TextRank
 Preprocessing
Lower case, lemmatize, stop words, punctuation,
tokenization, tag and filter by part-of-speech tags
 Identify Structurally important Keyword
 Iteratively Calculate:
  = 1   +  
    
1
 
 
d is the damping factor that usually set to 0.85
Approach 2 - TextRank
geico
auto
insurance
policy
privacy
find
car
coverage
call
sevice
1
1
1
1
1
1
1
1
1
1
geico
auto
insurance
policy
privacy
find
car
coverage
call
sevice
0.32
0.32
2.65
0.49
2.65
2.19
0.36
0.32
0.32
0.36
first
iteration
  = 1   +  
    
1
 
 
  = 0.15 + 0.85 
1
1
 1 +
1
1
 1 +
1
2
 1 +
1
5
 1 +
1
4
 1 = 2.65
service call auto insurance policy
  = 0.15 + 0.85 
1
1
 1 +
1
1
 1 +
1
5
 1 +
1
5
 1 = 2.19
find privacy insurance geico
5
5
4
2
1
1
1
1
1
1
 p = 0.15 + 0.85 
1
5
 1 = 0.32
geico
iterations
d is the damping factor
that usually set to 0.85
Approach 2 - TextRank
geico
auto
insurance
policy
privacy
find
car
coverage
call
sevice
0.51
0.51
2.12
0.87
2.12
1.77
0.52
0.51
0.51
0.52
geico
auto
insurance
policy
privacy
find
car
coverage
call
sevice
0.51
0.51
2.12
0.87
2.65
1.75
0.52
0.51
0.51
0.52
Converge
  = 1   +  
    
1
 
 
service call auto insurance policy
  = 0.15 + 0.85 
1
1
 0.52 +
1
1
 0.52 +
1
5
 2.12 +
1
5
 2.12 = 1.75
find privacy insurance geico
5
5
4
2
1
1
1
1
1
1
 p = 0.15 + 0.85 
1
5
 2.12 = 0.51
geico
10
iterations
Converge Really Quick!
(<= 20 iterations)
d is the damping factor
that usually set to 0.85
  = 0.15 + 0.85 
1
1
 0.51 +
1
1
 0.51 +
1
2
 0.87 +
1
5
 2.12 +
1
4
 1.77 = 2.12
Approach 3  Word2vec + ?
 Preprocessing
 No preprocessing (ideally)
 Keyword Extraction models
 Word2Vec + Clustering
Projection
matrix
0
1
0
0
.
.
.
0
0
.
.
.
0
0
1
0
0
0
.
.
.
0
0
0
0
1
0
0
0
0
0
0
1
.
.
.
.9
.8
.1
.
.
.
.
.1
5V*1
W(t)
W(1)
W(t-1)
W(2)
...
D*V
D*1
Continuous Bag-of-Words Model
+
Negative SamplingThe
cat
on
that
Projection
Matrix W
sits
cover
sample
input
predict
learn
believe
type
five
design
human
Cost Function:
log  $ $ヰ = ≠, , , ≠
=
exp( $ )
=1

exp((  ! = $)
Backpropagation:
 $  =   
   =  $   W
Gradient Descent:
ゐ

 = ゐ

   
 $   hi
ゐ  = ゐ    
  
softmax
0.366
0.2
0.103
0.100
0.009
0.011
0.045
0.050
0.070
0.010
0.009Projection
Matrix W
Approach 3  Word2vec + Clustering
 k-means
 DBSCAN
Approach 3  Word2vec + TextRank
W(1)
N*D
W(2)
W(3)
W(4)
W(n-2)
W(n-1)
W(n)














john
deere
compact
utility
tractor
taylor
messick
Inc
..
..
..
..
..
..
..
..
company
profile
agricultural
equipment
tractor
tillage
mower
excavator
sprayer
shredder
agriculture
harvest
mower
excavator
shredder
tillage
harvest
sprayer
Document Text
Trained Word2vec Model
TextRank
 Identify semantically important Keyword
Approach 4 TextRank + Word2vec
Word TextRank
Score
tractor 0.015847
john 0.013281
sale 0.012494
standard 0.012474
equipment 0.010799
power 0.009747
messick 0.008162
new 0.008151
work 0.007907
series 0.007707
mower 0.006099
utility 0.006035
compact 0.005751
TextRank Result
mower 0.8502
excavator 0.7708
shredder 0.7451
tillage 0.7341
harvest 0.7154
sprayer 0.7101
Word2vec Similarity
Word New Score
tractor 0.015847
mower 0.015847*0.8502= 0.013433
john 0.013281
sale 0.012494
standard 0.012474
excavator 0.015847*0.7708= 0.012215
shredder 0.015847*0.7451= 0.011808
tillage 0.015847*0.7341= 0.011633
harvest 0.015847*0.7154= 0.011337
sprayer 0.015847*0.7101= 0.011253
equipment 0.010799
power 0.009747
messick 0.008162
new 0.008151
Googles Pre-trained Word2vec
Campaign % Words in Pre-trained
Model Vocab.
% Keywords in Pre-trained
Model Vocab.
Geico 0.929985 0.88888
Taylor Messick (Agricultural
Equipment)
0.929784 0.41176
Trane (AC) 0.922018 0.71428
Model Testing
1. Generate keyword from the 4 models
2. Feed into Lucene and find urls
3. Track the audience who visited these urls
4. Compare the audience we find to the audience
the pixels find
Results (Dell) - Keyword
TFIDF TextRank Word2vec_Textrank TextRank_Word2vec
office dell outlet dell
dellcom support collaboration acquire
view service acquire laptop
electronics product work desktop
customer price purchase software
dell use spare rebate
representative software poster welding
dellcomreturnspolicy customer transformation windows
dells system apg corporations
information practices new dell please dell software
prosupport dell dell inc poster laptop desktop
products view dell outlet apg transformation dell new
services support dell dell today purchase acquire dell tablet
dell sales dell team spare transformation dell inc
Results (Toyota) - Keyword
TFIDF TextRank Word2vec_Textrank TextRank_Word2vec
highlander toyota generate toyota
kbbcom information acquire preowned
edmundscom site misuse certified
certify vehicle tale highlander
information use govern rav
certification program tradein yaris
site email fourwheel avalon
program service generate tale corolla
assistance sale rubbed bologna sequoia
violated please toyota site identify tundra
hybrid highlander toyota vehicle wheel camry
car certification toyota dealer rubbed tale venza
personal information new toyota help toyota vehicle
cruiser preowned toyota certified new avalon preowned
Results - Urls
Dell Toyota
http://thetechjournal.com/electronics/laptop/
dell-inspiron-15r-laptop.xhtml
http://www.adverts.ie/laptop-parts-and-
accessories/dell-laptop-charger-19-5v-4-62a-
90w/10838435
http://www.dellservicecentreinchennai.in/tab
let-repair-center-medavakkam.html
http://www.dell.com/us/business/p/powered
ge-c6320p/pd?oc=&model_id=poweredge-
c6320p&l=en&s=bsd
http://forum.notebookreview.com/threads/d
ell-2012-outlet-coupons.636641/page-21
http://www.macdonaldtoyota.ca/
http://www.stcharlestoyota.net
http://www.baldwintoyotaofpoplarbluf
f.com/
http://www.lafontainetoyota.com/
http://www.cedarrapidstoyota.com/
http://www.craigtoyota.com/
http://www.planettoyotaonline.com/
http://www.gatewaytoyotapierre.com/
Result - # of Converters
Result - % of Converters
CampaignId TFIDF TextRank TextRank_
Word2vec
Word2vec_
TextRank
13405 25 (0.2%) 99 (0.8%) 44 (0.4%) 1 (0.008%)
13553 229 (3.2%) 269 (3.7%) 252 (3.5%) 8 (0.1%)
14099 6 (0.03%) 57 (0.3%) 16 (0.08%) 2 (0.01%)
14545 247 (3%) 250 (3%) 482 (5.7%) 7 (0.08%)
15077 0 (0%) 4 (0.02%) 15 (0.08%) 6 (0.03%)
Conclusion
 TextRank and TextRank_Word2vec
consistently perform better than TFIDF
 TextRank dont require extra space for model
saving
 All 3 models need O(n) computational time
Automatic Search Event-Summary
Appendix
0
1
0
0
.
.
.
0
0
.
.
.
0
0
1
0
0
0
.
.
.
0
0
0
0
1
0
.
.
.
0
0
0
0
1
0
0
0
0
0
0
1
.
.
.
.1
.3
.7
.4
.9
.
.
.2
.01
.9
.2
.
.
.
.4
.5
.9
.8
.1
.
.
.
.
.1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5V*1
W(t)
W(1)
W(t-1)
W(2)
...
D*V
5D*1
.
.
.
.
.
.
.
.
.
.
tanh
Hidden
Layer
0.003
.
.
.
.
.
.
.
.
.
.
.
0.000
0.009
0.011
0.045
0.000
0.000
0.366
0.010
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0.010
0.000
0.000
Apple
.
.
.
.
.
.
.
.
.
.
.
Computer
point
traffic
inbox
policy
print
couch
choice
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
choose
later
media
Output
layer
softmax
Most
Computation
Neural Net
Language
Model
Maximize
1


log  ゐ, ゐ≠1,  , ゐ≠+1;  + ()
Time Complexity
   +      +   
The
cat
sits
on
that
Projection
Matrix
0
1
0
0
.
.
.
0
0
.
.
.
0
0
1
0
0
0
.
.
.
0
0
0
0
1
0
.
.
.
0
0
0
0
1
0
0
0
0
0
0
1
.
.
.
.1
.3
.7
.4
.9
.
.
.2
.01
.9
.2
.
.
.
.4
.5
.9
.8
.1
.
.
.
.
.1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5V*1
W(t)
W(1)
W(t-1)
W(2)
...
D*V
5D*1
.
.
.
.
.
.
.
.
.
.
tanh
Hidden
Layer
Hierarchical Probabilistic
Neural Net
Language Model
The
cat
sits
on
that
Projection
Matrix
TV
Computer
couch
table
make
choose
print
write
0
1
0
0
.
.
.
0
0
.
.
.
0
0
1
0
0
0
.
.
.
0
0
0
0
1
0
0
0
0
0
0
1
.
.
.
.9
.8
.1
.
.
.
.
.1
5V*1
W(t)
W(1)
W(t-1)
W(2)
...
D*V
D*1
Continuous
Bag-of-Words
Model
The
cat
on
that
Projection
Matrix
TV
Computer
couch
table
make
choose
sits
crawl

More Related Content

Similar to Automatic Search Event-Summary (20)

Continuous delivery in Pipedrive
Continuous delivery in PipedriveContinuous delivery in Pipedrive
Continuous delivery in Pipedrive
Tomas Rehor
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
Wim Godden
TDD with BizTalk
TDD with BizTalkTDD with BizTalk
TDD with BizTalk
Ben Carey
FDMEE Scripting - Cloud and On-Premises - It Ain't Groovy, But It's My Bread ...
FDMEE Scripting - Cloud and On-Premises - It Ain't Groovy, But It's My Bread ...FDMEE Scripting - Cloud and On-Premises - It Ain't Groovy, But It's My Bread ...
FDMEE Scripting - Cloud and On-Premises - It Ain't Groovy, But It's My Bread ...
Joseph Alaimo Jr
My Dad Won't Buy Me DevOps
My Dad Won't Buy Me DevOpsMy Dad Won't Buy Me DevOps
My Dad Won't Buy Me DevOps
XebiaLabs
Certification Study Group - NLP & Recommendation Systems on GCP Session 5
Certification Study Group - NLP & Recommendation Systems on GCP Session 5Certification Study Group - NLP & Recommendation Systems on GCP Session 5
Certification Study Group - NLP & Recommendation Systems on GCP Session 5
gdgsurrey
How to (Effectively) Measure Quality across Software Deliverables
How to (Effectively) Measure Quality across Software DeliverablesHow to (Effectively) Measure Quality across Software Deliverables
How to (Effectively) Measure Quality across Software Deliverables
TechWell
Closing the gap between development and production with Datadog and NerdVisio...
Closing the gap between development and production with Datadog and NerdVisio...Closing the gap between development and production with Datadog and NerdVisio...
Closing the gap between development and production with Datadog and NerdVisio...
David Thacker
MeasureWorks - The Art of Staying Fast
MeasureWorks - The Art of Staying FastMeasureWorks - The Art of Staying Fast
MeasureWorks - The Art of Staying Fast
MeasureWorks
Flink Forward San Francisco 2018: David Reniz & Dahyr Vergara - "Real-time m...
Flink Forward San Francisco 2018:  David Reniz & Dahyr Vergara - "Real-time m...Flink Forward San Francisco 2018:  David Reniz & Dahyr Vergara - "Real-time m...
Flink Forward San Francisco 2018: David Reniz & Dahyr Vergara - "Real-time m...
Flink Forward
Keptn: Unbreakable Continuous Delivery - Berlin CI/CD Meetup
Keptn: Unbreakable Continuous Delivery - Berlin CI/CD MeetupKeptn: Unbreakable Continuous Delivery - Berlin CI/CD Meetup
Keptn: Unbreakable Continuous Delivery - Berlin CI/CD Meetup
J端rgen Etzlstorfer
Agile Development in .NET
Agile Development in .NETAgile Development in .NET
Agile Development in .NET
danhermes
Art and Science Come Together When Mastering Relevance Ranking - Tom Burgmans...
Art and Science Come Together When Mastering Relevance Ranking - Tom Burgmans...Art and Science Come Together When Mastering Relevance Ranking - Tom Burgmans...
Art and Science Come Together When Mastering Relevance Ranking - Tom Burgmans...
Lucidworks
How to build real time price adjustments in vehicle insurance on Streams ( Do...
How to build real time price adjustments in vehicle insurance on Streams ( Do...How to build real time price adjustments in vehicle insurance on Streams ( Do...
How to build real time price adjustments in vehicle insurance on Streams ( Do...
confluent
Overcoming (organizational) scalability issues in your Prometheus ecosystem
Overcoming (organizational) scalability issues in your Prometheus ecosystemOvercoming (organizational) scalability issues in your Prometheus ecosystem
Overcoming (organizational) scalability issues in your Prometheus ecosystem
QAware GmbH
Agile Development From A Developers Perspective
Agile Development From A Developers PerspectiveAgile Development From A Developers Perspective
Agile Development From A Developers Perspective
Richard Banks
How Trade Desk Built a Connected Team of 100+ Service Agents
How Trade Desk Built a Connected Team of 100+ Service AgentsHow Trade Desk Built a Connected Team of 100+ Service Agents
How Trade Desk Built a Connected Team of 100+ Service Agents
Atlassian
Learn to see, measure and automate with value stream management
Learn to see, measure and automate with value stream managementLearn to see, measure and automate with value stream management
Learn to see, measure and automate with value stream management
Lance Knight
Continuous Delivery and Automated Operations on k8s with keptn
Continuous Delivery and Automated Operations on k8s with keptnContinuous Delivery and Automated Operations on k8s with keptn
Continuous Delivery and Automated Operations on k8s with keptn
Andreas Grabner
Business Event Driven Architecture & Governance in Action
Business Event Driven Architecture & Governance in ActionBusiness Event Driven Architecture & Governance in Action
Business Event Driven Architecture & Governance in Action
HostedbyConfluent
Continuous delivery in Pipedrive
Continuous delivery in PipedriveContinuous delivery in Pipedrive
Continuous delivery in Pipedrive
Tomas Rehor
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
Wim Godden
TDD with BizTalk
TDD with BizTalkTDD with BizTalk
TDD with BizTalk
Ben Carey
FDMEE Scripting - Cloud and On-Premises - It Ain't Groovy, But It's My Bread ...
FDMEE Scripting - Cloud and On-Premises - It Ain't Groovy, But It's My Bread ...FDMEE Scripting - Cloud and On-Premises - It Ain't Groovy, But It's My Bread ...
FDMEE Scripting - Cloud and On-Premises - It Ain't Groovy, But It's My Bread ...
Joseph Alaimo Jr
My Dad Won't Buy Me DevOps
My Dad Won't Buy Me DevOpsMy Dad Won't Buy Me DevOps
My Dad Won't Buy Me DevOps
XebiaLabs
Certification Study Group - NLP & Recommendation Systems on GCP Session 5
Certification Study Group - NLP & Recommendation Systems on GCP Session 5Certification Study Group - NLP & Recommendation Systems on GCP Session 5
Certification Study Group - NLP & Recommendation Systems on GCP Session 5
gdgsurrey
How to (Effectively) Measure Quality across Software Deliverables
How to (Effectively) Measure Quality across Software DeliverablesHow to (Effectively) Measure Quality across Software Deliverables
How to (Effectively) Measure Quality across Software Deliverables
TechWell
Closing the gap between development and production with Datadog and NerdVisio...
Closing the gap between development and production with Datadog and NerdVisio...Closing the gap between development and production with Datadog and NerdVisio...
Closing the gap between development and production with Datadog and NerdVisio...
David Thacker
MeasureWorks - The Art of Staying Fast
MeasureWorks - The Art of Staying FastMeasureWorks - The Art of Staying Fast
MeasureWorks - The Art of Staying Fast
MeasureWorks
Flink Forward San Francisco 2018: David Reniz & Dahyr Vergara - "Real-time m...
Flink Forward San Francisco 2018:  David Reniz & Dahyr Vergara - "Real-time m...Flink Forward San Francisco 2018:  David Reniz & Dahyr Vergara - "Real-time m...
Flink Forward San Francisco 2018: David Reniz & Dahyr Vergara - "Real-time m...
Flink Forward
Keptn: Unbreakable Continuous Delivery - Berlin CI/CD Meetup
Keptn: Unbreakable Continuous Delivery - Berlin CI/CD MeetupKeptn: Unbreakable Continuous Delivery - Berlin CI/CD Meetup
Keptn: Unbreakable Continuous Delivery - Berlin CI/CD Meetup
J端rgen Etzlstorfer
Agile Development in .NET
Agile Development in .NETAgile Development in .NET
Agile Development in .NET
danhermes
Art and Science Come Together When Mastering Relevance Ranking - Tom Burgmans...
Art and Science Come Together When Mastering Relevance Ranking - Tom Burgmans...Art and Science Come Together When Mastering Relevance Ranking - Tom Burgmans...
Art and Science Come Together When Mastering Relevance Ranking - Tom Burgmans...
Lucidworks
How to build real time price adjustments in vehicle insurance on Streams ( Do...
How to build real time price adjustments in vehicle insurance on Streams ( Do...How to build real time price adjustments in vehicle insurance on Streams ( Do...
How to build real time price adjustments in vehicle insurance on Streams ( Do...
confluent
Overcoming (organizational) scalability issues in your Prometheus ecosystem
Overcoming (organizational) scalability issues in your Prometheus ecosystemOvercoming (organizational) scalability issues in your Prometheus ecosystem
Overcoming (organizational) scalability issues in your Prometheus ecosystem
QAware GmbH
Agile Development From A Developers Perspective
Agile Development From A Developers PerspectiveAgile Development From A Developers Perspective
Agile Development From A Developers Perspective
Richard Banks
How Trade Desk Built a Connected Team of 100+ Service Agents
How Trade Desk Built a Connected Team of 100+ Service AgentsHow Trade Desk Built a Connected Team of 100+ Service Agents
How Trade Desk Built a Connected Team of 100+ Service Agents
Atlassian
Learn to see, measure and automate with value stream management
Learn to see, measure and automate with value stream managementLearn to see, measure and automate with value stream management
Learn to see, measure and automate with value stream management
Lance Knight
Continuous Delivery and Automated Operations on k8s with keptn
Continuous Delivery and Automated Operations on k8s with keptnContinuous Delivery and Automated Operations on k8s with keptn
Continuous Delivery and Automated Operations on k8s with keptn
Andreas Grabner
Business Event Driven Architecture & Governance in Action
Business Event Driven Architecture & Governance in ActionBusiness Event Driven Architecture & Governance in Action
Business Event Driven Architecture & Governance in Action
HostedbyConfluent

Automatic Search Event-Summary

Editor's Notes

  • #6: Gensim tfidf model take care of normalization by document length Sklearn tfidf model take care of normalization and pesudocount # log+1 instead of log makes sure terms with zero idf don't get suppressed entirely. # idf = np.log(float(n_samples) / df) + 1.0 Sklearn use natural log, while gensim tfidf use log2
  • #7: Gensim tfidf model take care of normalization by document length Sklearn tfidf model take care of normalization and pesudocount # log+1 instead of log makes sure terms with zero idf don't get suppressed entirely. # idf = np.log(float(n_samples) / df) + 1.0 Sklearn use natural log, while gensim tfidf use log2
  • #10: 250 word, 250 vertice
  • #13: Both need to set arbitrary parameters, which is hard to determine and have to tune the parameter Kmeans dont cluster well with the model we trained DBSCAN cluster better, but throw a lot of keywords as noise or all clustered as 1 big group, depending on the parameter set Still did not solve the problem with generalization
  • #15: Identify keywords that are either not in the document, or structurally less important in the document but semantically close to the more important keyword Integrate the structural importance with the semantic importance