The document describes different approaches to automatically extracting keywords from text for search event purposes. It discusses TF-IDF, TextRank, combining Word2Vec with TextRank, and training a Word2Vec model on pre-trained word embeddings to extract keywords. Initial results found TextRank and the combination of TextRank and Word2Vec performed better than TF-IDF at finding related pages and audiences.
Text Analytics on 2 Million Documents: A Case StudyAlyona Medelyan
油
- The document summarizes a talk on text analytics of 2 million documents. It discusses extracting keywords from large datasets efficiently using cloud computing resources and parallel processing. It provides examples of extracting keywords from a scientific paper dataset and compares results to human indexers. The talk outlines steps to estimate processing time, understand data characteristics, and leverage cloud infrastructure to scale keyword extraction across massive text corpora.
Beyond php it's not (just) about the codeWim Godden
油
The document discusses database queries and optimization. It begins with an example of a complex database query and explains how to detect problematic queries using tools like slow query log and pt-query-digest. It then discusses indexing strategies and when to use indexes. The document also describes a case study of a client's jobs search site that was experiencing high database load due to inefficient queries in a loop, and how batching the queries into a single query solved the problem.
際際滷deck for my session on Insider Dev Tour 2019 (Lisbon Jul 29th).
Mostly based on tools and platform support for AI workloads and the options for edge computing and cloud computing.
ML.NET, WinML, DirectML, Model Builder, Azure Cognitive Services, ...
Beyond php - it's not (just) about the codeWim Godden
油
Most PHP developers focus on writing code. But creating Web applications is about much more than just wrting PHP. Take a step outside the PHP cocoon and into the big PHP ecosphere to find out how small code changes can make a world of difference on servers and network. This talk is an eye-opener for developers who spend over 80% of their time coding, debugging and testing.
Keeping Your DevOps Transformation From Crushing Your Ops Capacity Rundeck
油
Presentation by Damon Edwards, co-founder of Rundeck, at DevOps Enterprise Summit in San Francisco, November 13, 2017
See a Demo of Rundeck Enterprise :
https://www.rundeck.com/see-demo
--or--
Download Rundeck Open Source here:
https://rundeck.com/open-source
Connect:
Stack Overflow community: https://stackoverflow.com/questions/tagged/rundeck
Github: https://github.com/rundeck/rundeck/issues
Twitter: https://twitter.com/Rundeck
Facebook: https://www.facebook.com/RundeckInc/
LinkedIn: www.linkedin.com company rundeck-inc
Beyond php - it's not (just) about the codeWim Godden
油
Most PHP developers focus on writing code. But creating Web applications is about much more than just wrting PHP. Take a step outside the PHP cocoon and into the big PHP ecosphere to find out how small code changes can make a world of difference on servers and network. This talk is an eye-opener for developers who spend over 80% of their time coding, debugging and testing.
Without Self-Service Operations, the Cloud is Just Expensive Hosting 2.0 - (a...dev2ops
油
The document discusses how without self-service operations, the cloud becomes expensive hosting 2.0. It argues that conventional cloud wisdom about time and cost savings may not be realized due to legacy processes and tooling that prevent organizations from fully taking advantage of cloud capabilities. It advocates for cross-functional delivery teams, turning information flow into artifact flow to reduce handoffs, and inserting verification points to drive feedback loops and continuous delivery.
Leverage the power of machine learning on windowsMia Chang
油
Note:
The Content was modified from the Microsoft Content team.
Deck Owner: Nitah Onsongo
Tech/Msg Review: Cesar De La Torre, Simon Tao, Clarke Rahrig
---
Event: Insider Dev Tour Berlin
Event Description: Microsoft is going on a world tour with the announcements of Build 2019. The Insider Dev Tour focuses on innovations related to Microsoft 365 from a developer's perspective.
Date: June 7th, 2019
Event link: https://www.microsoft.com/de-de/techwiese/news/best-of-build-insider-dev-tour-am-7-juni-in-berlin.aspx
Linkedin: http://linkedin.com/in/mia-chang/
How does a reliable and fast continuous delivery contribute to Engineering Culture? And how does Pipedrive do more than 65 production deployments per day? Answers are in this presentation. I just warn you, without my energetic speech, it's only half of the fun :)
Beyond php - it's not (just) about the codeWim Godden
油
Most PHP developers focus on writing code. But creating Web applications is about much more than just wrting PHP. Take a step outside the PHP cocoon and into the big PHP ecosphere to find out how small code changes can make a world of difference on servers and network. This talk is an eye-opener for developers who spend over 80% of their time coding, debugging and testing.
FDMEE Scripting - Cloud and On-Premises - It Ain't Groovy, But It's My Bread ...Joseph Alaimo Jr
油
This document provides an overview of integrated business analytics and FDMEE scripting. It discusses how enterprise performance management (EPM), business intelligence (BI), and big data (BD) solutions work together to provide answers for improved business performance. It then focuses on FDMEE scripting, covering topics like the FDMEE API, development mode, integration with cloud solutions, and best practices. The presentation is delivered by Tony Scalese of Edgewater Ranzal, an expert in Oracle Hyperion technologies with over 17 years of experience in the field.
TJ Randall, VP of Customer Success at XebiaLabs, gives his presentation on how to express the cost of your application delivery at the DevOps Leadership Summit in Boston MA.
Certification Study Group - NLP & Recommendation Systems on GCP Session 5gdgsurrey
油
This session features Raghavendra Guttur's exploration of "Atlas," a chatbot powered by Llama2-7b with MiniLM v2 enhancements for IT support. ChengCheng Tan will discuss ML pipeline automation, monitoring, optimization, and maintenance.
How to (Effectively) Measure Quality across Software DeliverablesTechWell
油
How do you properly compare the quality of two or more software deliverables without an accurate normalizing metric? The answer: You cant. Example: If project A has one-hundred defects and project B has fifty defects, do you automatically assume project B is a higher quality deliverable? Although the number of defects is often the end users quality perception, defect counts may not be the right measure. An effective normalizing metric allows you to accurately measure and compare quality levels across software deliverables. David Herron explains how to quickly and easily incorporate this important normalizing metric into your development process to start measuring and improving the quality of your software deliverables. Youll have a new tool for managing end user expectations regarding software quality in relation to the value the software delivers. Even more, you can use this normalizing metric to predict software quality outcomes or delivery dates and to establish service levels for software quality.
Closing the gap between development and production with Datadog and NerdVisio...David Thacker
油
NerdVision and Datadog help shorten the time it takes to debug production issues. NerdVision can dynamically instrument code without redeployments, collect variables, and add log lines to augment existing tools. This helps reduce the mean time to know what caused an error. A demo showed how NerdVision integrates with other tools to automatically debug errors and provide needed information to developers.
MeasureWorks - The Art of Staying FastMeasureWorks
油
1. The document discusses the importance of website speed and performance for user experience and conversion rates. It provides data showing that slow sites negatively impact user engagement.
2. It recommends establishing performance baselines and service level targets to prioritize speed optimizations. Metrics like time to first paint and time to interact should be under 1-3 seconds to provide a positive user experience.
3. The key takeaways are to design with performance in mind, measure performance against targets from an end-user perspective, and continuously optimize the user experience and flow to keep sites fast.
Flink Forward San Francisco 2018: David Reniz & Dahyr Vergara - "Real-time m...Flink Forward
油
Customer experience is the next big battle ground for telcos, proclaimed recently Amit Akhelikar, Global Director of Lynx Analytics at TM Forum Live! Asia in Singapore. But, how to fight in this battle? A common approach has been to keep under control some well-known network quality indicators, like dropped calls, radio access congestion, availability, and so on; but this has proven not to be enough to keep customers happy, like a siege weapon is not enough to conquer a city. But, what if it were possible to know how customers perceive services, at least most demanded ones, like web browsing or video streaming? That would be like a squad of archers ready to battle. And even having that, how to extract value of it and take actions in no time, giving our skilled archers the right targets? Meet CANVAS (Customer And Network Visualization and AnaltyticS), one of the first LATAM implementations of a Flink-based stream processing use case for a telco, which successfully combines leading and innovative technologies like Apache Hadoop, YARN, Kafka, Nifi, Druid and advanced visualizations with Flink core features like non-trivial stateful stream processing (joins, windows and aggregations on event time) and CEP capabilities for alarm generation, delivering a next-generation tool for SOC (Service Operation Center) teams.
Keptn: Unbreakable Continuous Delivery - Berlin CI/CD MeetupJ端rgen Etzlstorfer
油
Keptn is an open source framework that aims to solve the challenges of cloud native continuous delivery by automating multi-stage unbreakable delivery pipelines, self-healing blue/green deployments, and event-driven runbook automation. It uses GitOps principles and Kubernetes-native technologies to allow developers to focus on building code and operations to focus on automating tasks. Keptn handles tasks like creating development, staging, and production environments; onboarding and deploying services; running automated tests; evaluating deployments; and promoting changes between environments through a series of events and actions.
The is from the book Agile Development in .NET by Dan Hermes. Most Agile methodologies used in .NET shops nationwide are variations of Scrum and Extreme Programming(XP). This booklet covers these tools and techniques: Test-driven Development (TDD), Behavior-driven Development (BDD), Continuous Integration (CI), and Refactoring to Patterns. The QuickNotes series covers relevant topics in software development to provide the reader with a swift overview of important trends, terms, and concepts. This book is available at Amazon.com.
Art and Science Come Together When Mastering Relevance Ranking - Tom Burgmans...Lucidworks
油
Art and Science come together when mastering Relevance Ranking. The document discusses various ingredients for improving relevance ranking such as understanding relevance, preparing content, controlling scoring algorithms, using domain experts, and measuring expectations vs reality. It provides examples of tuning elements in scoring algorithms like term frequency, inverse document frequency, and function queries to customize relevance ranking.
How to build real time price adjustments in vehicle insurance on Streams ( Do...confluent
油
Let's talk about risks and pricing in insurance: From an insurance company the customer expects a fair (and affordable) tariff. How can we offer this, especially if the tariff model is very static? With KSQL, we are building the entire Processing Piplines directly in Kafka. With each deal we can re-evaluate the overall risk and learn from each claim. With each quote request, we understand the market better. And with this knowledge, we can adjust prices in real time to keep it cheap for the customer and still make some money. We expect peaks with twenty requests per second in Q4 and our partners allows us only one second to stick a price tag on the quote. Therefore we need a system that is fast, scalable and reliable. The central point is Confluent Kafka with a heavy use of streaming processing with KSQL. We start with the insurance product on 01.10. in the German market and look exclusively at architecture and function.
Overcoming (organizational) scalability issues in your Prometheus ecosystemQAware GmbH
油
Cloud Native Night, July 2020, online: Talk of J端rgen Etzlstorfer (@jetzlstorfer, Dynatrace)
== Please download slides if blurred! ==
Abstract:
Prometheus is considered a foundational building block when running applications on Kubernetes and has become the de-facto open-source standard for visibility and monitoring in Kubernetes environments.
Your first starting points when operating Prometheus are most probably configuring scraping to pull your metrics from your services, building dashboards on top of your data with Grafana, or defining alerts for important metrics breaching thresholds in your production environment. in your production environment.
As soon as you are comfortable with Prometheus as your weapon of choice, your next challenges will be scaling and managing Prometheus for your whole fleet of applications and environments. As the journey From Zero to Prometheus Hero is not trivial you will find obstacles on the way. In this talk we are highlighting the most common challenges we have seen and provide guidance on how to overcome them. Finally, we are discussing a solution to get you there more quickly to build automated, future-proof observability with Prometheus showing Keptn as one possible implementation.
About J端rgen:
J端rgen is a core contributor to the Keptn open-source project and responsible for the strategy and integration of self-healing techniques and tools into the Keptn framework. He also loves to share his experience, most recently at conferences on Kubernetes based technologies and automation.
More information:
Overview: https://github.com/keptn/community
Github: https://github.com/keptn/keptn
Website: https://keptn.sh
Google Group: https://groups.google.com/forum/#!forum/keptn
Twitter: https://twitter.com/keptnProject
________________________________________________
Follow us on:
https://twitter.com/qaware
https://www.linkedin.com/company/qaware-gmbh
https://github.com/qaware
www.qaware.de
Agile Development From A Developers PerspectiveRichard Banks
油
The document discusses Agile development from a developer's perspective. It defines Agile as a set of processes for faster software development that values individuals, collaboration, and response to change over rigid processes. The Agile Manifesto and principles emphasize satisfying customers, effective communication, trust, and continuous improvement. Specific Agile practices like Scrum and its roles, ceremonies, and artifacts are covered. The document also discusses engineering practices like testing, version control, and continuous integration used in Agile development.
How Trade Desk Built a Connected Team of 100+ Service AgentsAtlassian
油
THE TRADE DESK built a connected team of over 100 service agents by implementing a single service platform to streamline their processes. Previously, THE TRADE DESK used disparate applications that fragmented workflows and lacked visibility. The new platform allows requests to be triaged, spawned to the correct teams, and fulfilled through automated routing and approvals. This improved traceability, provided analytics on performance, and increased customer satisfaction by connecting the teams.
Learn to see, measure and automate with value stream managementLance Knight
油
This document discusses using value stream management to see, measure, and automate software delivery processes. It begins by explaining that traditional value stream maps can physically show material and information flows, but software delivery value streams are less tangible. The presentation then demonstrates how to map different flows in a software value stream, including request, development, and operations flows. It also discusses the importance of measuring key metrics in each flow in order to identify improvement opportunities. Finally, it explains how applying lean principles like reducing waste and creating flow can help optimize the software delivery value stream to improve outcomes like quality, speed, and productivity.
Continuous Delivery and Automated Operations on k8s with keptnAndreas Grabner
油
際際滷deck from Vienna DevOps & Security Meetup. This talk is keptn - an open source event driven control plane for continuous delivery and automated operations for kubernetes
Business Event Driven Architecture & Governance in ActionHostedbyConfluent
油
"Event-Driven Architecture is the only way to achieve resilient scalable reactive systems. It enables loose coupling, drives autonomy for the dev teams, and is the key to digital business behaviour monitoring.
But how do you implement EDA the right way? And make sure that it keeps being implemented the right way?
At Current 2022 there was a talk from Confluent explaining the value of a COE (Center of Excellence).
However, in this talk, Well be sharing several experiences in setting up a COE for large industrial companies, insurance and logistic environments.
From setting up a strong foundation, defining event designs, best practices, and principles to the guidance of development teams. The COE brings business and IT together to ensure EDA is set-up and used correctly, but also to identify and capitalize on new opportunities that automatically arise from using EDA.
Using several real life experiences (AXA Belgium, Engie, Nike, ...)"
How does a reliable and fast continuous delivery contribute to Engineering Culture? And how does Pipedrive do more than 65 production deployments per day? Answers are in this presentation. I just warn you, without my energetic speech, it's only half of the fun :)
Beyond php - it's not (just) about the codeWim Godden
油
Most PHP developers focus on writing code. But creating Web applications is about much more than just wrting PHP. Take a step outside the PHP cocoon and into the big PHP ecosphere to find out how small code changes can make a world of difference on servers and network. This talk is an eye-opener for developers who spend over 80% of their time coding, debugging and testing.
FDMEE Scripting - Cloud and On-Premises - It Ain't Groovy, But It's My Bread ...Joseph Alaimo Jr
油
This document provides an overview of integrated business analytics and FDMEE scripting. It discusses how enterprise performance management (EPM), business intelligence (BI), and big data (BD) solutions work together to provide answers for improved business performance. It then focuses on FDMEE scripting, covering topics like the FDMEE API, development mode, integration with cloud solutions, and best practices. The presentation is delivered by Tony Scalese of Edgewater Ranzal, an expert in Oracle Hyperion technologies with over 17 years of experience in the field.
TJ Randall, VP of Customer Success at XebiaLabs, gives his presentation on how to express the cost of your application delivery at the DevOps Leadership Summit in Boston MA.
Certification Study Group - NLP & Recommendation Systems on GCP Session 5gdgsurrey
油
This session features Raghavendra Guttur's exploration of "Atlas," a chatbot powered by Llama2-7b with MiniLM v2 enhancements for IT support. ChengCheng Tan will discuss ML pipeline automation, monitoring, optimization, and maintenance.
How to (Effectively) Measure Quality across Software DeliverablesTechWell
油
How do you properly compare the quality of two or more software deliverables without an accurate normalizing metric? The answer: You cant. Example: If project A has one-hundred defects and project B has fifty defects, do you automatically assume project B is a higher quality deliverable? Although the number of defects is often the end users quality perception, defect counts may not be the right measure. An effective normalizing metric allows you to accurately measure and compare quality levels across software deliverables. David Herron explains how to quickly and easily incorporate this important normalizing metric into your development process to start measuring and improving the quality of your software deliverables. Youll have a new tool for managing end user expectations regarding software quality in relation to the value the software delivers. Even more, you can use this normalizing metric to predict software quality outcomes or delivery dates and to establish service levels for software quality.
Closing the gap between development and production with Datadog and NerdVisio...David Thacker
油
NerdVision and Datadog help shorten the time it takes to debug production issues. NerdVision can dynamically instrument code without redeployments, collect variables, and add log lines to augment existing tools. This helps reduce the mean time to know what caused an error. A demo showed how NerdVision integrates with other tools to automatically debug errors and provide needed information to developers.
MeasureWorks - The Art of Staying FastMeasureWorks
油
1. The document discusses the importance of website speed and performance for user experience and conversion rates. It provides data showing that slow sites negatively impact user engagement.
2. It recommends establishing performance baselines and service level targets to prioritize speed optimizations. Metrics like time to first paint and time to interact should be under 1-3 seconds to provide a positive user experience.
3. The key takeaways are to design with performance in mind, measure performance against targets from an end-user perspective, and continuously optimize the user experience and flow to keep sites fast.
Flink Forward San Francisco 2018: David Reniz & Dahyr Vergara - "Real-time m...Flink Forward
油
Customer experience is the next big battle ground for telcos, proclaimed recently Amit Akhelikar, Global Director of Lynx Analytics at TM Forum Live! Asia in Singapore. But, how to fight in this battle? A common approach has been to keep under control some well-known network quality indicators, like dropped calls, radio access congestion, availability, and so on; but this has proven not to be enough to keep customers happy, like a siege weapon is not enough to conquer a city. But, what if it were possible to know how customers perceive services, at least most demanded ones, like web browsing or video streaming? That would be like a squad of archers ready to battle. And even having that, how to extract value of it and take actions in no time, giving our skilled archers the right targets? Meet CANVAS (Customer And Network Visualization and AnaltyticS), one of the first LATAM implementations of a Flink-based stream processing use case for a telco, which successfully combines leading and innovative technologies like Apache Hadoop, YARN, Kafka, Nifi, Druid and advanced visualizations with Flink core features like non-trivial stateful stream processing (joins, windows and aggregations on event time) and CEP capabilities for alarm generation, delivering a next-generation tool for SOC (Service Operation Center) teams.
Keptn: Unbreakable Continuous Delivery - Berlin CI/CD MeetupJ端rgen Etzlstorfer
油
Keptn is an open source framework that aims to solve the challenges of cloud native continuous delivery by automating multi-stage unbreakable delivery pipelines, self-healing blue/green deployments, and event-driven runbook automation. It uses GitOps principles and Kubernetes-native technologies to allow developers to focus on building code and operations to focus on automating tasks. Keptn handles tasks like creating development, staging, and production environments; onboarding and deploying services; running automated tests; evaluating deployments; and promoting changes between environments through a series of events and actions.
The is from the book Agile Development in .NET by Dan Hermes. Most Agile methodologies used in .NET shops nationwide are variations of Scrum and Extreme Programming(XP). This booklet covers these tools and techniques: Test-driven Development (TDD), Behavior-driven Development (BDD), Continuous Integration (CI), and Refactoring to Patterns. The QuickNotes series covers relevant topics in software development to provide the reader with a swift overview of important trends, terms, and concepts. This book is available at Amazon.com.
Art and Science Come Together When Mastering Relevance Ranking - Tom Burgmans...Lucidworks
油
Art and Science come together when mastering Relevance Ranking. The document discusses various ingredients for improving relevance ranking such as understanding relevance, preparing content, controlling scoring algorithms, using domain experts, and measuring expectations vs reality. It provides examples of tuning elements in scoring algorithms like term frequency, inverse document frequency, and function queries to customize relevance ranking.
How to build real time price adjustments in vehicle insurance on Streams ( Do...confluent
油
Let's talk about risks and pricing in insurance: From an insurance company the customer expects a fair (and affordable) tariff. How can we offer this, especially if the tariff model is very static? With KSQL, we are building the entire Processing Piplines directly in Kafka. With each deal we can re-evaluate the overall risk and learn from each claim. With each quote request, we understand the market better. And with this knowledge, we can adjust prices in real time to keep it cheap for the customer and still make some money. We expect peaks with twenty requests per second in Q4 and our partners allows us only one second to stick a price tag on the quote. Therefore we need a system that is fast, scalable and reliable. The central point is Confluent Kafka with a heavy use of streaming processing with KSQL. We start with the insurance product on 01.10. in the German market and look exclusively at architecture and function.
Overcoming (organizational) scalability issues in your Prometheus ecosystemQAware GmbH
油
Cloud Native Night, July 2020, online: Talk of J端rgen Etzlstorfer (@jetzlstorfer, Dynatrace)
== Please download slides if blurred! ==
Abstract:
Prometheus is considered a foundational building block when running applications on Kubernetes and has become the de-facto open-source standard for visibility and monitoring in Kubernetes environments.
Your first starting points when operating Prometheus are most probably configuring scraping to pull your metrics from your services, building dashboards on top of your data with Grafana, or defining alerts for important metrics breaching thresholds in your production environment. in your production environment.
As soon as you are comfortable with Prometheus as your weapon of choice, your next challenges will be scaling and managing Prometheus for your whole fleet of applications and environments. As the journey From Zero to Prometheus Hero is not trivial you will find obstacles on the way. In this talk we are highlighting the most common challenges we have seen and provide guidance on how to overcome them. Finally, we are discussing a solution to get you there more quickly to build automated, future-proof observability with Prometheus showing Keptn as one possible implementation.
About J端rgen:
J端rgen is a core contributor to the Keptn open-source project and responsible for the strategy and integration of self-healing techniques and tools into the Keptn framework. He also loves to share his experience, most recently at conferences on Kubernetes based technologies and automation.
More information:
Overview: https://github.com/keptn/community
Github: https://github.com/keptn/keptn
Website: https://keptn.sh
Google Group: https://groups.google.com/forum/#!forum/keptn
Twitter: https://twitter.com/keptnProject
________________________________________________
Follow us on:
https://twitter.com/qaware
https://www.linkedin.com/company/qaware-gmbh
https://github.com/qaware
www.qaware.de
Agile Development From A Developers PerspectiveRichard Banks
油
The document discusses Agile development from a developer's perspective. It defines Agile as a set of processes for faster software development that values individuals, collaboration, and response to change over rigid processes. The Agile Manifesto and principles emphasize satisfying customers, effective communication, trust, and continuous improvement. Specific Agile practices like Scrum and its roles, ceremonies, and artifacts are covered. The document also discusses engineering practices like testing, version control, and continuous integration used in Agile development.
How Trade Desk Built a Connected Team of 100+ Service AgentsAtlassian
油
THE TRADE DESK built a connected team of over 100 service agents by implementing a single service platform to streamline their processes. Previously, THE TRADE DESK used disparate applications that fragmented workflows and lacked visibility. The new platform allows requests to be triaged, spawned to the correct teams, and fulfilled through automated routing and approvals. This improved traceability, provided analytics on performance, and increased customer satisfaction by connecting the teams.
Learn to see, measure and automate with value stream managementLance Knight
油
This document discusses using value stream management to see, measure, and automate software delivery processes. It begins by explaining that traditional value stream maps can physically show material and information flows, but software delivery value streams are less tangible. The presentation then demonstrates how to map different flows in a software value stream, including request, development, and operations flows. It also discusses the importance of measuring key metrics in each flow in order to identify improvement opportunities. Finally, it explains how applying lean principles like reducing waste and creating flow can help optimize the software delivery value stream to improve outcomes like quality, speed, and productivity.
Continuous Delivery and Automated Operations on k8s with keptnAndreas Grabner
油
際際滷deck from Vienna DevOps & Security Meetup. This talk is keptn - an open source event driven control plane for continuous delivery and automated operations for kubernetes
Business Event Driven Architecture & Governance in ActionHostedbyConfluent
油
"Event-Driven Architecture is the only way to achieve resilient scalable reactive systems. It enables loose coupling, drives autonomy for the dev teams, and is the key to digital business behaviour monitoring.
But how do you implement EDA the right way? And make sure that it keeps being implemented the right way?
At Current 2022 there was a talk from Confluent explaining the value of a COE (Center of Excellence).
However, in this talk, Well be sharing several experiences in setting up a COE for large industrial companies, insurance and logistic environments.
From setting up a strong foundation, defining event designs, best practices, and principles to the guidance of development teams. The COE brings business and IT together to ensure EDA is set-up and used correctly, but also to identify and capitalize on new opportunities that automatically arise from using EDA.
Using several real life experiences (AXA Belgium, Engie, Nike, ...)"
6. Approach 1 - TFIDF
Preprocessing
Lower case, lemmatize, stop words, punctuation, tokenization, tag and
filter by part-of-speech tags
Keyword Extraction models
TF-IDF
TF-IDF(w, d, n, N) = TF(w, d) * IDF(n, N)
TF(w, d) = # times word w occurred
in doc d
IDF(n, N) = # docs the word w appears
Word Term
freq in
doc1
Appear
in #
docs
Tfidf
car 27 3 0
auto 3 2 1.216
Insurance 0 2 0
Best 14 2 5.676
7. Approach 2 -
TextRank
Preprocessing
Lower case, lemmatize, stop words, punctuation,
tokenization, tag and filter by part-of-speech tags
Identify Structurally important Keyword
Iteratively Calculate:
= 1 +
1
d is the damping factor that usually set to 0.85
8. Approach 2 - TextRank
geico
auto
insurance
policy
privacy
find
car
coverage
call
sevice
1
1
1
1
1
1
1
1
1
1
geico
auto
insurance
policy
privacy
find
car
coverage
call
sevice
0.32
0.32
2.65
0.49
2.65
2.19
0.36
0.32
0.32
0.36
first
iteration
= 1 +
1
= 0.15 + 0.85
1
1
1 +
1
1
1 +
1
2
1 +
1
5
1 +
1
4
1 = 2.65
service call auto insurance policy
= 0.15 + 0.85
1
1
1 +
1
1
1 +
1
5
1 +
1
5
1 = 2.19
find privacy insurance geico
5
5
4
2
1
1
1
1
1
1
p = 0.15 + 0.85
1
5
1 = 0.32
geico
iterations
d is the damping factor
that usually set to 0.85
13. Approach 3 Word2vec + TextRank
W(1)
N*D
W(2)
W(3)
W(4)
W(n-2)
W(n-1)
W(n)
john
deere
compact
utility
tractor
taylor
messick
Inc
..
..
..
..
..
..
..
..
company
profile
agricultural
equipment
tractor
tillage
mower
excavator
sprayer
shredder
agriculture
harvest
mower
excavator
shredder
tillage
harvest
sprayer
Document Text
Trained Word2vec Model
TextRank
Identify semantically important Keyword
14. Approach 4 TextRank + Word2vec
Word TextRank
Score
tractor 0.015847
john 0.013281
sale 0.012494
standard 0.012474
equipment 0.010799
power 0.009747
messick 0.008162
new 0.008151
work 0.007907
series 0.007707
mower 0.006099
utility 0.006035
compact 0.005751
TextRank Result
mower 0.8502
excavator 0.7708
shredder 0.7451
tillage 0.7341
harvest 0.7154
sprayer 0.7101
Word2vec Similarity
Word New Score
tractor 0.015847
mower 0.015847*0.8502= 0.013433
john 0.013281
sale 0.012494
standard 0.012474
excavator 0.015847*0.7708= 0.012215
shredder 0.015847*0.7451= 0.011808
tillage 0.015847*0.7341= 0.011633
harvest 0.015847*0.7154= 0.011337
sprayer 0.015847*0.7101= 0.011253
equipment 0.010799
power 0.009747
messick 0.008162
new 0.008151
15. Googles Pre-trained Word2vec
Campaign % Words in Pre-trained
Model Vocab.
% Keywords in Pre-trained
Model Vocab.
Geico 0.929985 0.88888
Taylor Messick (Agricultural
Equipment)
0.929784 0.41176
Trane (AC) 0.922018 0.71428
16. Model Testing
1. Generate keyword from the 4 models
2. Feed into Lucene and find urls
3. Track the audience who visited these urls
4. Compare the audience we find to the audience
the pixels find
17. Results (Dell) - Keyword
TFIDF TextRank Word2vec_Textrank TextRank_Word2vec
office dell outlet dell
dellcom support collaboration acquire
view service acquire laptop
electronics product work desktop
customer price purchase software
dell use spare rebate
representative software poster welding
dellcomreturnspolicy customer transformation windows
dells system apg corporations
information practices new dell please dell software
prosupport dell dell inc poster laptop desktop
products view dell outlet apg transformation dell new
services support dell dell today purchase acquire dell tablet
dell sales dell team spare transformation dell inc
18. Results (Toyota) - Keyword
TFIDF TextRank Word2vec_Textrank TextRank_Word2vec
highlander toyota generate toyota
kbbcom information acquire preowned
edmundscom site misuse certified
certify vehicle tale highlander
information use govern rav
certification program tradein yaris
site email fourwheel avalon
program service generate tale corolla
assistance sale rubbed bologna sequoia
violated please toyota site identify tundra
hybrid highlander toyota vehicle wheel camry
car certification toyota dealer rubbed tale venza
personal information new toyota help toyota vehicle
cruiser preowned toyota certified new avalon preowned
22. Conclusion
TextRank and TextRank_Word2vec
consistently perform better than TFIDF
TextRank dont require extra space for model
saving
All 3 models need O(n) computational time
#6: Gensim tfidf model take care of normalization by document length
Sklearn tfidf model take care of normalization and pesudocount
# log+1 instead of log makes sure terms with zero idf don't get suppressed entirely.
# idf = np.log(float(n_samples) / df) + 1.0
Sklearn use natural log, while gensim tfidf use log2
#7: Gensim tfidf model take care of normalization by document length
Sklearn tfidf model take care of normalization and pesudocount
# log+1 instead of log makes sure terms with zero idf don't get suppressed entirely.
# idf = np.log(float(n_samples) / df) + 1.0
Sklearn use natural log, while gensim tfidf use log2
#13: Both need to set arbitrary parameters, which is hard to determine and have to tune the parameter
Kmeans dont cluster well with the model we trained
DBSCAN cluster better, but throw a lot of keywords as noise or all clustered as 1 big group, depending on the parameter set
Still did not solve the problem with generalization
#15: Identify keywords that are either not in the document, or structurally less important in the document but semantically close to the more important keyword
Integrate the structural importance with the semantic importance