This study proposes methods to automatically extract and recommend popular and relevant URLs from Twitter to support serendipitous learning for software developers. The researchers collected tweets from seed Twitter users and extracted URLs, calculating 14 features for each. URLs were labeled for relevance and a supervised learning-to-rank model and unsupervised Borda count approach were used to recommend URLs. The supervised approach achieved better performance with an NDCG of 0.832. Future work includes automatically categorizing URLs and building a full recommendation system.
1 of 48
Download to read offline
More Related Content
Saner17 sharma
1. Harnessing Twitter to Support
Serendipitous Learning of Developers
Abhishek Sharma1, Yuan Tian1, Agus Sulistya1, David Lo1
and Aiko Fallas Yamashita2
1School of Information Systems,
Singapore Management University
2Oslo and Akershus University, Norway
24th IEEE International Conference on Software Analysis,
Evolution, and Reengineering (SANER 2017)
2. Keeping up to date a big challenge
(Storey et al. TSE16)
Developer Challenges?
2
3. Why Twitter for Learning
Keeping up to date a big challenge
(Storey et al. TSE16)
Twitter is used by software
developers to share important
information (Tian et al. MSR12)
2
https://unsplash.com/photos/HAIPJ8PyeL8
4. Why Twitter for Learning
Keeping up to date a big challenge
(Storey et al. TSE16)
Twitter is used by software
developers to share important
information (Tian et al. MSR12)
Twitter enables serendipitous
(pleasant and undirected) learning
for developers (Singer et al.
ICSE14)
2
https://unsplash.com/photos/HAIPJ8PyeL8
6. Challenges
Finding useful articles not easy
Developers need to identify
many relevant Twitter users to follow
sieve through a large amount of
tweets/URLs
3
7. Challenges
Finding useful articles not easy
Developers need to identify
many relevant Twitter users to follow
sieve through a large amount of
tweets/URLs
Singer et al. ICSE14
3
8. Challenges
Finding useful articles not easy
Developers need to identify
many relevant Twitter users to follow
sieve through a large amount of
tweets/URLs
Singer et al. ICSE14
Too much information can make learning using Twitter an
unpleasant experience
3
https://unsplash.com/photos/yD5rv8_WzxA
9. This Study
Can we automatically extract popular and relevant URLs
from Twitter for developers
In this work, we:
propose 14 features to characterize a URL
evaluate a supervised and unsupervised approach to
recommend URLs harvested from Twitter
4
11. Methodology (1): Collecting Seed Data
Get a list of seed twitter users
5
http://www.noop.nl/2009/02/twitter-top-100-for-softwaredevelopers.htm
12. Methodology (1): Collecting Seed Data
Get a list of seed twitter users
5
http://www.noop.nl/2009/02/twitter-top-100-for-softwaredevelopers.htm
13. Methodology (1): Collecting Seed Data
Get a list of seed twitter users
Get a larger set of people who
Follow (or are followed by) >= 5 seed users
Results in 85,171 Twitter users
5
14. Methodology (1): Collecting Seed Data
Get a list of seed twitter users
Get a larger set of people who
Follow (or are followed by) >= 5 seed users
Results in 85,171 Twitter users
Collect tweets generated by these users for 1 month
period (Nov 15)
5
24. Methodology (3): Feature Extraction
Content
cosine similarity between
keyword and
tweet text (CosSimT)
8
25. Methodology (3): Feature Extraction
Content
cosine similarity between
keyword and
tweet text (CosSimT)
user profile text (CosSimP)
8
26. Methodology (3): Feature Extraction
Content
cosine similarity between
keyword and
tweet text (CosSimT)
user profile text (CosSimP)
webpage text (CosSimW)
8
28. Methodology (3): Feature Extraction
Network
estimate importance of
users through
centrality scores
page rank
9
29. Network
estimate importance of
users through
centrality scores
page rank
9
Methodology (3): Feature Extraction
30. Network
estimate importance of
users through
centrality scores
page rank
Popularity
number of times the
tweets containing the
URL were
9
Methodology (3): Feature Extraction
31. Network
estimate importance of
users through
centrality scores
page rank
Popularity
number of times the
tweets containing the
URL were
retweeted
9
Methodology (3): Feature Extraction
32. Network
estimate importance of
users through
centrality scores
page rank
Popularity
number of times the
tweets containing the
URL were
retweeted
liked
9
Methodology (3): Feature Extraction
33. Methodology (4): Labelling the URLs
Labelled independently by
2 persons having having more than 4 years of professional
programming experience in Java
one a PhD student and another a Research Engineer
10
34. Methodology (4): Labelling the URLs
Labelled independently by
2 persons having having more than 4 years of professional
programming experience in Java
one a PhD student and another a Research Engineer
Both persons sat together to resolve disagreements
10
35. Methodology (4): Labelling the URLs
Labelled independently by
2 persons having having more than 4 years of professional
programming experience in Java
one a PhD student and another a Research Engineer
Both persons sat together to resolve disagreements
URLs assigned relevance scores from 0-3
10
36. Methodology (5): Recommendation
Unsupervised Borda Count
assigns ranking points for each feature score for an
URL and then combines the scores
11
Supervised Learning to Rank
learns a ranking function based on the weighted sum
of features of an URL
37. RQ1: Effectiveness of Our Approach
12
NDCG (Normalized Discounted Cumulative Gain)
Measures the capability to recommend higher ranked URLs at
top ranks
Score closer to 1 specifies better performance with the range
of scores being 0-1
38. RQ1: Effectiveness of Our Approach
12
0.832
0.719
0
0.2
0.4
0.6
0.8
1
Supervised Unsupervised
NDCGScore
Recommendation Approach
NDCG (Normalized Discounted Cumulative Gain)
Measures the capability to recommend higher ranked URLs at
top ranks
Score closer to 1 specifies better performance with the range
of scores being 0-1
42. Threats to Validity
Subjectivity in the labelling process
asked 2 persons to label independently
14
43. Threats to Validity
Subjectivity in the labelling process
asked 2 persons to label independently
Only 1 domain
14
44. Threats to Validity
Subjectivity in the labelling process
asked 2 persons to label independently
Only 1 domain
evaluate more domains in future work
14
45. Threats to Validity
Subjectivity in the labelling process
asked 2 persons to label independently
Only 1 domain
evaluate more domains in future work
Suitability of evaluation metric
14
46. Threats to Validity
Subjectivity in the labelling process
asked 2 persons to label independently
Only 1 domain
evaluate more domains in future work
Suitability of evaluation metric
used NDCG which is a standard metric
14
47. Conclusion and Future Work
Supervised and unsupervised approaches
show promise in recommending URLs
Future work:
Automatically categorize the recommended
URLs
Build an automated system to recommend
relevant URLs
15
48. Feedback/Advice
What additional resources we can
consider for mining URLs?
How to infer developer interests
automatically?
Thank you!