際際滷

際際滷Share a Scribd company logo
Harnessing Twitter to Support
Serendipitous Learning of Developers
Abhishek Sharma1, Yuan Tian1, Agus Sulistya1, David Lo1
and Aiko Fallas Yamashita2
1School of Information Systems,
Singapore Management University
2Oslo and Akershus University, Norway
24th IEEE International Conference on Software Analysis,
Evolution, and Reengineering (SANER 2017)
 Keeping up to date a big challenge
(Storey et al. TSE16)
Developer Challenges?
2
Why Twitter for Learning
 Keeping up to date a big challenge
(Storey et al. TSE16)
 Twitter is used by software
developers to share important
information (Tian et al. MSR12)
2
https://unsplash.com/photos/HAIPJ8PyeL8
Why Twitter for Learning
 Keeping up to date a big challenge
(Storey et al. TSE16)
 Twitter is used by software
developers to share important
information (Tian et al. MSR12)
 Twitter enables serendipitous
(pleasant and undirected) learning
for developers (Singer et al.
ICSE14)
2
https://unsplash.com/photos/HAIPJ8PyeL8
Challenges
 Finding useful articles not easy
3
Challenges
 Finding useful articles not easy
 Developers need to identify
 many relevant Twitter users to follow
 sieve through a large amount of
tweets/URLs
3
Challenges
 Finding useful articles not easy
 Developers need to identify
 many relevant Twitter users to follow
 sieve through a large amount of
tweets/URLs
Singer et al. ICSE14
3
Challenges
 Finding useful articles not easy
 Developers need to identify
 many relevant Twitter users to follow
 sieve through a large amount of
tweets/URLs
Singer et al. ICSE14
 Too much information can make learning using Twitter an
unpleasant experience
3
https://unsplash.com/photos/yD5rv8_WzxA
This Study
 Can we automatically extract popular and relevant URLs
from Twitter for developers
 In this work, we:
 propose 14 features to characterize a URL
 evaluate a supervised and unsupervised approach to
recommend URLs harvested from Twitter
4
Methodology (1): Collecting Seed Data
5
Methodology (1): Collecting Seed Data
 Get a list of seed twitter users
5
http://www.noop.nl/2009/02/twitter-top-100-for-softwaredevelopers.htm
Methodology (1): Collecting Seed Data
 Get a list of seed twitter users
5
http://www.noop.nl/2009/02/twitter-top-100-for-softwaredevelopers.htm
Methodology (1): Collecting Seed Data
 Get a list of seed twitter users
 Get a larger set of people who
 Follow (or are followed by) >= 5 seed users
 Results in 85,171 Twitter users
5
Methodology (1): Collecting Seed Data
 Get a list of seed twitter users
 Get a larger set of people who
 Follow (or are followed by) >= 5 seed users
 Results in 85,171 Twitter users
 Collect tweets generated by these users for 1 month
period (Nov 15)
5
Methodology (2): URL Extraction
615
Methodology (2): URL Extraction
 Find tweets which contain keyword java (2,104 tweets)
616
Methodology (2): URL Extraction
 Find tweets which contain keyword java (2,104 tweets)
 Find tweets which contain an URL (1,606 tweets)
617
https://t.co/
https://b.ly/
https://go.cl
Methodology (2): URL Extraction
 Find tweets which contain keyword java (2,104 tweets)
 Find tweets which contain an URL (1,606 tweets)
 Extract URLs
http://ow.ly/UIxwS
http://bit.ly/1OFsZSj
http://goo.gl/IGxGlo
https://t.co/ryPI3
618
https://t.co/
https://b.ly/
https://go.cl
Methodology (2): URL Extraction
 Find tweets which contain keyword java (2,104 tweets)
 Find tweets which contain an URL (1,606 tweets)
 Extract URLs
 Expand short URLs (770 expanded URLs)
http://abc.com
http://xyz.com
http://abc.com
http://xyz.com
619
https://t.co/
https://b.ly/
https://go.cl
Methodology (2): URL Extraction
 Find tweets which contain keyword java (2,104 tweets)
 Find tweets which contain an URL (1,606 tweets)
 Extract URLs
 Expand short URLs (770 expanded URLs)
 Resolve duplicate/broken URLs (577)
http://abc.com
http://xyz.com
620
https://t.co/
https://b.ly/
https://go.cl
Methodology (3): Feature Extraction
 14 features extracted
 Content
 Popularity
 Network
7
Methodology (3): Feature Extraction
 Content
8
Methodology (3): Feature Extraction
 Content
 cosine similarity between
keyword and
8
Methodology (3): Feature Extraction
 Content
 cosine similarity between
keyword and
 tweet text (CosSimT)
8
Methodology (3): Feature Extraction
 Content
 cosine similarity between
keyword and
 tweet text (CosSimT)
 user profile text (CosSimP)
8
Methodology (3): Feature Extraction
 Content
 cosine similarity between
keyword and
 tweet text (CosSimT)
 user profile text (CosSimP)
 webpage text (CosSimW)
8
Methodology (3): Feature Extraction
 Network
9
Methodology (3): Feature Extraction
 Network
 estimate importance of
users through
 centrality scores
 page rank
9
 Network
 estimate importance of
users through
 centrality scores
 page rank
9
Methodology (3): Feature Extraction
 Network
 estimate importance of
users through
 centrality scores
 page rank
 Popularity
 number of times the
tweets containing the
URL were
9
Methodology (3): Feature Extraction
 Network
 estimate importance of
users through
 centrality scores
 page rank
 Popularity
 number of times the
tweets containing the
URL were
 retweeted
9
Methodology (3): Feature Extraction
 Network
 estimate importance of
users through
 centrality scores
 page rank
 Popularity
 number of times the
tweets containing the
URL were
 retweeted
 liked
9
Methodology (3): Feature Extraction
Methodology (4): Labelling the URLs
 Labelled independently by
 2 persons having having more than 4 years of professional
programming experience in Java
 one a PhD student and another a Research Engineer
10
Methodology (4): Labelling the URLs
 Labelled independently by
 2 persons having having more than 4 years of professional
programming experience in Java
 one a PhD student and another a Research Engineer
 Both persons sat together to resolve disagreements
10
Methodology (4): Labelling the URLs
 Labelled independently by
 2 persons having having more than 4 years of professional
programming experience in Java
 one a PhD student and another a Research Engineer
 Both persons sat together to resolve disagreements
 URLs assigned relevance scores from 0-3
10
Methodology (5): Recommendation
 Unsupervised Borda Count
 assigns ranking points for each feature score for an
URL and then combines the scores
11
 Supervised Learning to Rank
 learns a ranking function based on the weighted sum
of features of an URL
RQ1: Effectiveness of Our Approach
12
 NDCG (Normalized Discounted Cumulative Gain)
 Measures the capability to recommend higher ranked URLs at
top ranks
 Score closer to 1 specifies better performance with the range
of scores being 0-1
RQ1: Effectiveness of Our Approach
12
0.832
0.719
0
0.2
0.4
0.6
0.8
1
Supervised Unsupervised
NDCGScore
Recommendation Approach
 NDCG (Normalized Discounted Cumulative Gain)
 Measures the capability to recommend higher ranked URLs at
top ranks
 Score closer to 1 specifies better performance with the range
of scores being 0-1
RQ2: Sensitivity of Supervised
Approach to Training Data
13
0.832
0.825
0.833
0.845
0.834
0.842
0.837
0.847
0.843
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
10
9
8
7
6
5
4
3
2
NDCG Score
k(nooffoldsused)
Threats to Validity
14
Threats to Validity
 Subjectivity in the labelling process
14
Threats to Validity
 Subjectivity in the labelling process
 asked 2 persons to label independently
14
Threats to Validity
 Subjectivity in the labelling process
 asked 2 persons to label independently
 Only 1 domain
14
Threats to Validity
 Subjectivity in the labelling process
 asked 2 persons to label independently
 Only 1 domain
 evaluate more domains in future work
14
Threats to Validity
 Subjectivity in the labelling process
 asked 2 persons to label independently
 Only 1 domain
 evaluate more domains in future work
 Suitability of evaluation metric
14
Threats to Validity
 Subjectivity in the labelling process
 asked 2 persons to label independently
 Only 1 domain
 evaluate more domains in future work
 Suitability of evaluation metric
 used NDCG which is a standard metric
14
Conclusion and Future Work
 Supervised and unsupervised approaches
show promise in recommending URLs
 Future work:
 Automatically categorize the recommended
URLs
 Build an automated system to recommend
relevant URLs
15
Feedback/Advice
 What additional resources we can
consider for mining URLs?
 How to infer developer interests
automatically?
Thank you!

More Related Content

Saner17 sharma

  • 1. Harnessing Twitter to Support Serendipitous Learning of Developers Abhishek Sharma1, Yuan Tian1, Agus Sulistya1, David Lo1 and Aiko Fallas Yamashita2 1School of Information Systems, Singapore Management University 2Oslo and Akershus University, Norway 24th IEEE International Conference on Software Analysis, Evolution, and Reengineering (SANER 2017)
  • 2. Keeping up to date a big challenge (Storey et al. TSE16) Developer Challenges? 2
  • 3. Why Twitter for Learning Keeping up to date a big challenge (Storey et al. TSE16) Twitter is used by software developers to share important information (Tian et al. MSR12) 2 https://unsplash.com/photos/HAIPJ8PyeL8
  • 4. Why Twitter for Learning Keeping up to date a big challenge (Storey et al. TSE16) Twitter is used by software developers to share important information (Tian et al. MSR12) Twitter enables serendipitous (pleasant and undirected) learning for developers (Singer et al. ICSE14) 2 https://unsplash.com/photos/HAIPJ8PyeL8
  • 5. Challenges Finding useful articles not easy 3
  • 6. Challenges Finding useful articles not easy Developers need to identify many relevant Twitter users to follow sieve through a large amount of tweets/URLs 3
  • 7. Challenges Finding useful articles not easy Developers need to identify many relevant Twitter users to follow sieve through a large amount of tweets/URLs Singer et al. ICSE14 3
  • 8. Challenges Finding useful articles not easy Developers need to identify many relevant Twitter users to follow sieve through a large amount of tweets/URLs Singer et al. ICSE14 Too much information can make learning using Twitter an unpleasant experience 3 https://unsplash.com/photos/yD5rv8_WzxA
  • 9. This Study Can we automatically extract popular and relevant URLs from Twitter for developers In this work, we: propose 14 features to characterize a URL evaluate a supervised and unsupervised approach to recommend URLs harvested from Twitter 4
  • 11. Methodology (1): Collecting Seed Data Get a list of seed twitter users 5 http://www.noop.nl/2009/02/twitter-top-100-for-softwaredevelopers.htm
  • 12. Methodology (1): Collecting Seed Data Get a list of seed twitter users 5 http://www.noop.nl/2009/02/twitter-top-100-for-softwaredevelopers.htm
  • 13. Methodology (1): Collecting Seed Data Get a list of seed twitter users Get a larger set of people who Follow (or are followed by) >= 5 seed users Results in 85,171 Twitter users 5
  • 14. Methodology (1): Collecting Seed Data Get a list of seed twitter users Get a larger set of people who Follow (or are followed by) >= 5 seed users Results in 85,171 Twitter users Collect tweets generated by these users for 1 month period (Nov 15) 5
  • 15. Methodology (2): URL Extraction 615
  • 16. Methodology (2): URL Extraction Find tweets which contain keyword java (2,104 tweets) 616
  • 17. Methodology (2): URL Extraction Find tweets which contain keyword java (2,104 tweets) Find tweets which contain an URL (1,606 tweets) 617 https://t.co/ https://b.ly/ https://go.cl
  • 18. Methodology (2): URL Extraction Find tweets which contain keyword java (2,104 tweets) Find tweets which contain an URL (1,606 tweets) Extract URLs http://ow.ly/UIxwS http://bit.ly/1OFsZSj http://goo.gl/IGxGlo https://t.co/ryPI3 618 https://t.co/ https://b.ly/ https://go.cl
  • 19. Methodology (2): URL Extraction Find tweets which contain keyword java (2,104 tweets) Find tweets which contain an URL (1,606 tweets) Extract URLs Expand short URLs (770 expanded URLs) http://abc.com http://xyz.com http://abc.com http://xyz.com 619 https://t.co/ https://b.ly/ https://go.cl
  • 20. Methodology (2): URL Extraction Find tweets which contain keyword java (2,104 tweets) Find tweets which contain an URL (1,606 tweets) Extract URLs Expand short URLs (770 expanded URLs) Resolve duplicate/broken URLs (577) http://abc.com http://xyz.com 620 https://t.co/ https://b.ly/ https://go.cl
  • 21. Methodology (3): Feature Extraction 14 features extracted Content Popularity Network 7
  • 22. Methodology (3): Feature Extraction Content 8
  • 23. Methodology (3): Feature Extraction Content cosine similarity between keyword and 8
  • 24. Methodology (3): Feature Extraction Content cosine similarity between keyword and tweet text (CosSimT) 8
  • 25. Methodology (3): Feature Extraction Content cosine similarity between keyword and tweet text (CosSimT) user profile text (CosSimP) 8
  • 26. Methodology (3): Feature Extraction Content cosine similarity between keyword and tweet text (CosSimT) user profile text (CosSimP) webpage text (CosSimW) 8
  • 27. Methodology (3): Feature Extraction Network 9
  • 28. Methodology (3): Feature Extraction Network estimate importance of users through centrality scores page rank 9
  • 29. Network estimate importance of users through centrality scores page rank 9 Methodology (3): Feature Extraction
  • 30. Network estimate importance of users through centrality scores page rank Popularity number of times the tweets containing the URL were 9 Methodology (3): Feature Extraction
  • 31. Network estimate importance of users through centrality scores page rank Popularity number of times the tweets containing the URL were retweeted 9 Methodology (3): Feature Extraction
  • 32. Network estimate importance of users through centrality scores page rank Popularity number of times the tweets containing the URL were retweeted liked 9 Methodology (3): Feature Extraction
  • 33. Methodology (4): Labelling the URLs Labelled independently by 2 persons having having more than 4 years of professional programming experience in Java one a PhD student and another a Research Engineer 10
  • 34. Methodology (4): Labelling the URLs Labelled independently by 2 persons having having more than 4 years of professional programming experience in Java one a PhD student and another a Research Engineer Both persons sat together to resolve disagreements 10
  • 35. Methodology (4): Labelling the URLs Labelled independently by 2 persons having having more than 4 years of professional programming experience in Java one a PhD student and another a Research Engineer Both persons sat together to resolve disagreements URLs assigned relevance scores from 0-3 10
  • 36. Methodology (5): Recommendation Unsupervised Borda Count assigns ranking points for each feature score for an URL and then combines the scores 11 Supervised Learning to Rank learns a ranking function based on the weighted sum of features of an URL
  • 37. RQ1: Effectiveness of Our Approach 12 NDCG (Normalized Discounted Cumulative Gain) Measures the capability to recommend higher ranked URLs at top ranks Score closer to 1 specifies better performance with the range of scores being 0-1
  • 38. RQ1: Effectiveness of Our Approach 12 0.832 0.719 0 0.2 0.4 0.6 0.8 1 Supervised Unsupervised NDCGScore Recommendation Approach NDCG (Normalized Discounted Cumulative Gain) Measures the capability to recommend higher ranked URLs at top ranks Score closer to 1 specifies better performance with the range of scores being 0-1
  • 39. RQ2: Sensitivity of Supervised Approach to Training Data 13 0.832 0.825 0.833 0.845 0.834 0.842 0.837 0.847 0.843 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 10 9 8 7 6 5 4 3 2 NDCG Score k(nooffoldsused)
  • 41. Threats to Validity Subjectivity in the labelling process 14
  • 42. Threats to Validity Subjectivity in the labelling process asked 2 persons to label independently 14
  • 43. Threats to Validity Subjectivity in the labelling process asked 2 persons to label independently Only 1 domain 14
  • 44. Threats to Validity Subjectivity in the labelling process asked 2 persons to label independently Only 1 domain evaluate more domains in future work 14
  • 45. Threats to Validity Subjectivity in the labelling process asked 2 persons to label independently Only 1 domain evaluate more domains in future work Suitability of evaluation metric 14
  • 46. Threats to Validity Subjectivity in the labelling process asked 2 persons to label independently Only 1 domain evaluate more domains in future work Suitability of evaluation metric used NDCG which is a standard metric 14
  • 47. Conclusion and Future Work Supervised and unsupervised approaches show promise in recommending URLs Future work: Automatically categorize the recommended URLs Build an automated system to recommend relevant URLs 15
  • 48. Feedback/Advice What additional resources we can consider for mining URLs? How to infer developer interests automatically? Thank you!