狠狠撸

狠狠撸Share a Scribd company logo
DATA MINING AND STATISTICAL ANALYSIS SOLUTIONS
Skills demand analysis based on the data from
online HR websites: Using web scraping and text
mining applications: IT Sector
Habet Madoyan
Vahe Movsisyan
Sunday, July 03, 2016
The analysis is funded by the research grant from American University of Armenia.
Presented at:
IX International School-Seminar. Town of Tsakhkadzor, Republic of Armenia
Methodology:
Overview
Datamotus LLC 2
Introduction
In recent years online job ads became a popular job-search model, that’s
why the research community is increasingly experimenting with the
detailed breakdown of online job ads to study labor market dynamics.
It is estimated that in USA 60-70 percent of job openings are now posted
on the Internet. However these job ads are biased toward industries and
occupations that seek high-skilled, “white-collar” workers.
Introduction
Job seekers, employers, students, researchers, policymakers, higher education
institutions, career advisors, and curriculum developers now view online job ads
data as a practical source to explore the nature of today’s dynamic of labor market.
Online job ads can show the relative demand for different types of skills and levels
of education. The real-time nature of job ads data also allows for the early
detection of labor demand trends, which gives job seekers, employers, and
policymakers a forward-looking analytical tool.
Real-time labor market indicators can be particularly useful in aligning education
and training curricula with workforce needs in emerging or rapidly changing
industries, such as healthcare and information technology, etc.
Job ads provide an incomplete picture of labor
demand
Online job ads data strongly correlate with job
openings data
IT Skills Analysis
Web Scraping
Text Mining
Datamotus LLC 7
Synopsys of the study
? Develop an algorithm for web scrapping job announcement
data (careercenter.am)
? Text mining and parsing algorithms to structure job
announcements
? Algorithms to assess and track vacancy rates by:
? Industry
? Job role
? Specific skills
What was done
? Around 20,000 posts are scrapped from the web,
? Posts come in rough, unstructured way. Algorithm is
developed to structure them.
A variable for each “section”
Total vacancy rate (Careercenter) and Official Labor
Demand (2004-2016 I Quarter)
Datamotus LLC 11
500
1000
1500
2000
2500
3000
100
150
200
250
300
350
400
450
500
550
600
2004Q1
2004Q2
2004Q3
2004Q4
2005Q1
2005Q2
2005Q3
2005Q4
2006Q1
2006Q2
2006Q3
2006Q4
2007Q1
2007Q2
2007Q3
2007Q4
2008Q1
2008Q2
2008Q3
2008Q4
2009Q1
2009Q2
2009Q3
2009Q4
2010Q1
2010Q2
2010Q3
2010Q4
2011Q1
2011Q2
2011Q3
2011Q4
2012Q1
2012Q2
2012Q3
2012Q4
2013Q1
2013Q2
2013Q3
2013Q4
2014Q1
2014Q2
2014Q3
2014Q4
2015Q1
2015Q2
2015Q3
2015Q4
2016Q1
Total jobs (Careercenter) Job Demand (NSS, right scale)
Correlation=0.76
Job Market Overview
IT sector
Datamotus LLC 12
ICT sector and overall economy
Datamotus LLC 13
3.00
3.20
3.40
3.60
3.80
4.00
4.20
4.40
1.60
1.70
1.80
1.90
2.00
2.10
2.20
2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
Average yearly wage in Transport and Communication sector/Average yearly wage in RA
Weight of Transport and Communication sector (including IT sector) in GDP (right scale, in %)
Total vacancy and IT sector vacancy rates (Careercenter,
2004-2016)
Datamotus LLC 14
0
20
40
60
80
100
120
140
160
180
200
100
150
200
250
300
350
400
450
2004Q1
2004Q2
2004Q3
2004Q4
2005Q1
2005Q2
2005Q3
2005Q4
2006Q1
2006Q2
2006Q3
2006Q4
2007Q1
2007Q2
2007Q3
2007Q4
2008Q1
2008Q2
2008Q3
2008Q4
2009Q1
2009Q2
2009Q3
2009Q4
2010Q1
2010Q2
2010Q3
2010Q4
2011Q1
2011Q2
2011Q3
2011Q4
2012Q1
2012Q2
2012Q3
2012Q4
2013Q1
2013Q2
2013Q3
2013Q4
2014Q1
2014Q2
2014Q3
2014Q4
2015Q1
2015Q2
2015Q3
2015Q4
2016Q1
Non IT Jobs (Careercenter) IT Jobs (Careercenter, right scale)
Correlation=0.81
Hard Skills in IT
Sector
Datamotus LLC 15
Time series: Annual demand for top 5 programming languages
Datamotus LLC 16
0
50
100
150
200
250
2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
C++ Javascript Java C# PHP
Time series: Annual demand for top 5 programming languages
(parabolic trend)
Datamotus LLC 17
-30
20
70
120
170
220
2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
Poly. (C++) Poly. (Javascript) Poly. (Java) Poly. (C#) Poly. (PHP)
Analyzing demand for
programming languages using
association rules
Datamotus LLC 18
Arules
? Association rules mining is used to analyse the co-
occurrence of programming languages in a job post
? R package “”arules” and “arulesViz” are used for
the analysis
? Analysis is done for IT jobs only
Association rules: Measures of rules
interestingness
Datamotus LLC 20
Measure 1
Support = ? ? ∩ ?
Measure 2
Confidence = ? ?|? = ?(? ∩ ?)/?(?)
Measure 3
Lift =
? ?|?
? ?
=
?(?∩?)
?(?)
?
1
?(?)
Suppose we have the rule : IF {A} = > {B}
Visualizing the rules
Datamotus LLC 21
Association Mining for
Programming languages: C++
Datamotus LLC 22
? Set of association rules is generated for top20 programming languages.
? Rules are subsetted with min support of 0.01 and min confidence of 0.1
Two items on the left
One item on the left
Association Mining for
Programming languages: Java
Datamotus LLC 23
Rules visualization:
Java (all rules)
Datamotus LLC 24
Rules Visualization:
Javascript
Datamotus LLC 25
Job Title Analysis
Datamotus LLC 26
IT Job Titles Frequency
Datamotus LLC 27
Most popular Job Titles (2004Q1-2016Q1) Percentage
software developer/engineer 18.29%
quality assurance engineer 5.42%
java software developer 4.98%
system administrator 4.00%
web developer 3.66%
.net developer 2.94%
php developer 2.33%
graphic designer 1.89%
ios developer 1.31%
android developer 1.26%
deep submicron 0.98%
database developer 0.96%
support specialist 0.96%
database administrator 0.92%
technical support 0.89%
technical writer 0.83%
support engineer 0.80%
application developer 0.72%
design engineer 0.72%
r&d engineer 0.68%
team leader 0.67%
frontend developer 0.55%
monitoring evaluation 0.52%
information security 0.50%
senior r&d 0.50%
57.29%
Software developer/engineer
Datamotus LLC 28
0
20
40
60
80
100
120
140
2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
Quality assurance engineer
Datamotus LLC 29
0
5
10
15
20
25
30
35
40
45
2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
quality.assurance.engineer
Java software developer
Datamotus LLC 30
0
5
10
15
20
25
30
35
40
45
50
2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
java.software.developer
System administrator
Datamotus LLC 31
0
5
10
15
20
25
30
35
2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
system.administrator
Web developer
Datamotus LLC 32
0
5
10
15
20
25
30
35
40
2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
web.developer
IT Job Titles vs Programming
languages
Job Titile => Programming language confidence Job Titile => Programming language confidence
{software developer/engineer} => {csharp} 0.33 {java software developer} => {java} 0.98
{software developer/engineer} => {java} 0.30 {java software developer} => {javascript} 0.47
{software developer/engineer} => {javascript} 0.20 {java software developer} => {j} 0.39
{software developer/engineer} => {asp} 0.20 {java software developer} => {shell} 0.11
{software developer/engineer} => {php} 0.12 {java software developer} => {ruby} 0.05
{software developer/engineer} => {j} 0.12 {system administrator} => {perl} 0.09
{software developer/engineer} => {tcl} 0.09 {system administrator} => {shell} 0.09
{software developer/engineer} => {python} 0.07 {system administrator} => {bash} 0.03
{software developer/engineer} => {cplusplus} 0.06 {system administrator} => {pl.sql} 0.02
{software developer/engineer} => {ruby} 0.03 {web developer} => {javascript} 0.76
{software developer/engineer} => {visual.basic} 0.02 {web developer} => {php} 0.57
{software developer/engineer} => {verilog} 0.02 {web developer} => {asp} 0.36
{quality assurance engineer} => {java} 0.27 {web developer} => {csharp} 0.27
{quality assurance engineer} => {shell} 0.25 {web developer} => {ruby} 0.02
{quality assurance engineer} => {perl} 0.22 {.net developer} => {asp} 0.82
{quality assurance engineer} => {python} 0.14 {.net developer} => {csharp} 0.80
{quality assurance engineer} => {tcl} 0.12 {.net developer} => {javascript} 0.42
{quality assurance engineer} => {bash} 0.04 {.net developer} => {visual.basic} 0.03
{quality assurance engineer} => {verilog} 0.04 {php developer} => {php} 1.00
{php developer} => {javascript} 0.71
{php developer} => {ruby} 0.08
{php developer} => {python} 0.07
Datamotus LLC 33
Next Steps:
? Develop machine learning algorithm to classify job ads by sectors,
? Develop state of art text mining and topic modeling algorithms to
predict demand for skills, professions and job roles,
? Create interactive web dashboard (using R shiny) to help:
? Potential job seekers
? Potential employees
? Policy makers
? Universities
Datamotus LLC 34
Thank You For Your Attention!
Datamotus LLC 35

More Related Content

IT Skills Analysis

  • 1. DATA MINING AND STATISTICAL ANALYSIS SOLUTIONS Skills demand analysis based on the data from online HR websites: Using web scraping and text mining applications: IT Sector Habet Madoyan Vahe Movsisyan Sunday, July 03, 2016 The analysis is funded by the research grant from American University of Armenia. Presented at: IX International School-Seminar. Town of Tsakhkadzor, Republic of Armenia
  • 3. Introduction In recent years online job ads became a popular job-search model, that’s why the research community is increasingly experimenting with the detailed breakdown of online job ads to study labor market dynamics. It is estimated that in USA 60-70 percent of job openings are now posted on the Internet. However these job ads are biased toward industries and occupations that seek high-skilled, “white-collar” workers.
  • 4. Introduction Job seekers, employers, students, researchers, policymakers, higher education institutions, career advisors, and curriculum developers now view online job ads data as a practical source to explore the nature of today’s dynamic of labor market. Online job ads can show the relative demand for different types of skills and levels of education. The real-time nature of job ads data also allows for the early detection of labor demand trends, which gives job seekers, employers, and policymakers a forward-looking analytical tool. Real-time labor market indicators can be particularly useful in aligning education and training curricula with workforce needs in emerging or rapidly changing industries, such as healthcare and information technology, etc.
  • 5. Job ads provide an incomplete picture of labor demand Online job ads data strongly correlate with job openings data
  • 8. Synopsys of the study ? Develop an algorithm for web scrapping job announcement data (careercenter.am) ? Text mining and parsing algorithms to structure job announcements ? Algorithms to assess and track vacancy rates by: ? Industry ? Job role ? Specific skills
  • 9. What was done ? Around 20,000 posts are scrapped from the web, ? Posts come in rough, unstructured way. Algorithm is developed to structure them.
  • 10. A variable for each “section”
  • 11. Total vacancy rate (Careercenter) and Official Labor Demand (2004-2016 I Quarter) Datamotus LLC 11 500 1000 1500 2000 2500 3000 100 150 200 250 300 350 400 450 500 550 600 2004Q1 2004Q2 2004Q3 2004Q4 2005Q1 2005Q2 2005Q3 2005Q4 2006Q1 2006Q2 2006Q3 2006Q4 2007Q1 2007Q2 2007Q3 2007Q4 2008Q1 2008Q2 2008Q3 2008Q4 2009Q1 2009Q2 2009Q3 2009Q4 2010Q1 2010Q2 2010Q3 2010Q4 2011Q1 2011Q2 2011Q3 2011Q4 2012Q1 2012Q2 2012Q3 2012Q4 2013Q1 2013Q2 2013Q3 2013Q4 2014Q1 2014Q2 2014Q3 2014Q4 2015Q1 2015Q2 2015Q3 2015Q4 2016Q1 Total jobs (Careercenter) Job Demand (NSS, right scale) Correlation=0.76
  • 12. Job Market Overview IT sector Datamotus LLC 12
  • 13. ICT sector and overall economy Datamotus LLC 13 3.00 3.20 3.40 3.60 3.80 4.00 4.20 4.40 1.60 1.70 1.80 1.90 2.00 2.10 2.20 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 Average yearly wage in Transport and Communication sector/Average yearly wage in RA Weight of Transport and Communication sector (including IT sector) in GDP (right scale, in %)
  • 14. Total vacancy and IT sector vacancy rates (Careercenter, 2004-2016) Datamotus LLC 14 0 20 40 60 80 100 120 140 160 180 200 100 150 200 250 300 350 400 450 2004Q1 2004Q2 2004Q3 2004Q4 2005Q1 2005Q2 2005Q3 2005Q4 2006Q1 2006Q2 2006Q3 2006Q4 2007Q1 2007Q2 2007Q3 2007Q4 2008Q1 2008Q2 2008Q3 2008Q4 2009Q1 2009Q2 2009Q3 2009Q4 2010Q1 2010Q2 2010Q3 2010Q4 2011Q1 2011Q2 2011Q3 2011Q4 2012Q1 2012Q2 2012Q3 2012Q4 2013Q1 2013Q2 2013Q3 2013Q4 2014Q1 2014Q2 2014Q3 2014Q4 2015Q1 2015Q2 2015Q3 2015Q4 2016Q1 Non IT Jobs (Careercenter) IT Jobs (Careercenter, right scale) Correlation=0.81
  • 15. Hard Skills in IT Sector Datamotus LLC 15
  • 16. Time series: Annual demand for top 5 programming languages Datamotus LLC 16 0 50 100 150 200 250 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 C++ Javascript Java C# PHP
  • 17. Time series: Annual demand for top 5 programming languages (parabolic trend) Datamotus LLC 17 -30 20 70 120 170 220 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 Poly. (C++) Poly. (Javascript) Poly. (Java) Poly. (C#) Poly. (PHP)
  • 18. Analyzing demand for programming languages using association rules Datamotus LLC 18
  • 19. Arules ? Association rules mining is used to analyse the co- occurrence of programming languages in a job post ? R package “”arules” and “arulesViz” are used for the analysis ? Analysis is done for IT jobs only
  • 20. Association rules: Measures of rules interestingness Datamotus LLC 20 Measure 1 Support = ? ? ∩ ? Measure 2 Confidence = ? ?|? = ?(? ∩ ?)/?(?) Measure 3 Lift = ? ?|? ? ? = ?(?∩?) ?(?) ? 1 ?(?) Suppose we have the rule : IF {A} = > {B}
  • 22. Association Mining for Programming languages: C++ Datamotus LLC 22 ? Set of association rules is generated for top20 programming languages. ? Rules are subsetted with min support of 0.01 and min confidence of 0.1 Two items on the left One item on the left
  • 23. Association Mining for Programming languages: Java Datamotus LLC 23
  • 24. Rules visualization: Java (all rules) Datamotus LLC 24
  • 27. IT Job Titles Frequency Datamotus LLC 27 Most popular Job Titles (2004Q1-2016Q1) Percentage software developer/engineer 18.29% quality assurance engineer 5.42% java software developer 4.98% system administrator 4.00% web developer 3.66% .net developer 2.94% php developer 2.33% graphic designer 1.89% ios developer 1.31% android developer 1.26% deep submicron 0.98% database developer 0.96% support specialist 0.96% database administrator 0.92% technical support 0.89% technical writer 0.83% support engineer 0.80% application developer 0.72% design engineer 0.72% r&d engineer 0.68% team leader 0.67% frontend developer 0.55% monitoring evaluation 0.52% information security 0.50% senior r&d 0.50% 57.29%
  • 28. Software developer/engineer Datamotus LLC 28 0 20 40 60 80 100 120 140 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
  • 29. Quality assurance engineer Datamotus LLC 29 0 5 10 15 20 25 30 35 40 45 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 quality.assurance.engineer
  • 30. Java software developer Datamotus LLC 30 0 5 10 15 20 25 30 35 40 45 50 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 java.software.developer
  • 31. System administrator Datamotus LLC 31 0 5 10 15 20 25 30 35 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 system.administrator
  • 32. Web developer Datamotus LLC 32 0 5 10 15 20 25 30 35 40 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 web.developer
  • 33. IT Job Titles vs Programming languages Job Titile => Programming language confidence Job Titile => Programming language confidence {software developer/engineer} => {csharp} 0.33 {java software developer} => {java} 0.98 {software developer/engineer} => {java} 0.30 {java software developer} => {javascript} 0.47 {software developer/engineer} => {javascript} 0.20 {java software developer} => {j} 0.39 {software developer/engineer} => {asp} 0.20 {java software developer} => {shell} 0.11 {software developer/engineer} => {php} 0.12 {java software developer} => {ruby} 0.05 {software developer/engineer} => {j} 0.12 {system administrator} => {perl} 0.09 {software developer/engineer} => {tcl} 0.09 {system administrator} => {shell} 0.09 {software developer/engineer} => {python} 0.07 {system administrator} => {bash} 0.03 {software developer/engineer} => {cplusplus} 0.06 {system administrator} => {pl.sql} 0.02 {software developer/engineer} => {ruby} 0.03 {web developer} => {javascript} 0.76 {software developer/engineer} => {visual.basic} 0.02 {web developer} => {php} 0.57 {software developer/engineer} => {verilog} 0.02 {web developer} => {asp} 0.36 {quality assurance engineer} => {java} 0.27 {web developer} => {csharp} 0.27 {quality assurance engineer} => {shell} 0.25 {web developer} => {ruby} 0.02 {quality assurance engineer} => {perl} 0.22 {.net developer} => {asp} 0.82 {quality assurance engineer} => {python} 0.14 {.net developer} => {csharp} 0.80 {quality assurance engineer} => {tcl} 0.12 {.net developer} => {javascript} 0.42 {quality assurance engineer} => {bash} 0.04 {.net developer} => {visual.basic} 0.03 {quality assurance engineer} => {verilog} 0.04 {php developer} => {php} 1.00 {php developer} => {javascript} 0.71 {php developer} => {ruby} 0.08 {php developer} => {python} 0.07 Datamotus LLC 33
  • 34. Next Steps: ? Develop machine learning algorithm to classify job ads by sectors, ? Develop state of art text mining and topic modeling algorithms to predict demand for skills, professions and job roles, ? Create interactive web dashboard (using R shiny) to help: ? Potential job seekers ? Potential employees ? Policy makers ? Universities Datamotus LLC 34
  • 35. Thank You For Your Attention! Datamotus LLC 35