�ݺ�ߣ

Harnessing Twitter to Support
Serendipitous Learning of Developers
Abhishek Sharma1, Yuan Tian1, Agus Sulistya1, David Lo1
and Aiko Fallas Yamashita2
1School of Information Systems,
Singapore Management University
2Oslo and Akershus University, Norway
24th IEEE International Conference on Software Analysis,
Evolution, and Reengineering (SANER 2017)

• Keeping up to date a big challenge
(Storey et al. TSE’16)
Developer Challenges?
2

Why Twitter for Learning
• Twitter is used by software
developers to share important
information (Tian et al. MSR’12)
2
https://unsplash.com/photos/HAIPJ8PyeL8

Why Twitter for Learning
• Twitter is used by software
developers to share important
information (Tian et al. MSR’12)
• Twitter enables serendipitous
(pleasant and undirected) learning
for developers (Singer et al.
ICSE’14)
2
https://unsplash.com/photos/HAIPJ8PyeL8

Challenges
• Finding useful articles not easy
3

Challenges
• Developers need to identify
– many relevant Twitter users to follow
– sieve through a large amount of
tweets/URLs
3

Challenges
tweets/URLs
Singer et al. ICSE’14
3

Challenges
tweets/URLs
Singer et al. ICSE’14
• Too much information can make learning using Twitter an
unpleasant experience
3
https://unsplash.com/photos/yD5rv8_WzxA

This Study
• Can we automatically extract popular and relevant URLs
from Twitter for developers
• In this work, we:
• propose 14 features to characterize a URL
• evaluate a supervised and unsupervised approach to
recommend URLs harvested from Twitter
4

Methodology (1): Collecting Seed Data
5

• Get a list of seed twitter users
5
http://www.noop.nl/2009/02/twitter-top-100-for-softwaredevelopers.htm

• Get a larger set of people who
– Follow (or are followed by) >= 5 seed users
– Results in 85,171 Twitter users
5

• Get a larger set of people who
– Follow (or are followed by) >= 5 seed users
– Results in 85,171 Twitter users
• Collect tweets generated by these users for 1 month
period (Nov’ 15)
5

Methodology (2): URL Extraction
615

• Find tweets which contain keyword “java” (2,104 tweets)
616

• Find tweets which contain an URL (1,606 tweets)
617
https://t.co/
https://b.ly/
https://go.cl

• Extract URLs
http://ow.ly/UIxwS
http://bit.ly/1OFsZSj
http://goo.gl/IGxGlo
https://t.co/ryPI3
618
https://t.co/
https://b.ly/
https://go.cl

• Extract URLs
• Expand short URLs (770 expanded URLs)
http://abc.com
http://xyz.com
http://abc.com
http://xyz.com
619
https://t.co/
https://b.ly/
https://go.cl

• Extract URLs
• Expand short URLs (770 expanded URLs)
• Resolve duplicate/broken URLs (577)
http://abc.com
http://xyz.com
620
https://t.co/
https://b.ly/
https://go.cl

Methodology (3): Feature Extraction
• 14 features extracted
– Content
– Popularity
– Network
7

• Content
8

• Content
– cosine similarity between
keyword and
8

• Content
keyword and
• tweet text (CosSimT)
8

• Content
keyword and
• user profile text (CosSimP)
8

• Content
keyword and
• user profile text (CosSimP)
• webpage text (CosSimW)
8

– Network
9

– Network
• estimate importance of
users through
– centrality scores
– page rank
9

– Network
users through
– page rank
9

– Network
users through
– page rank
– Popularity
• number of times the
tweets containing the
URL were
9

– Network
users through
– page rank
– Popularity
URL were
– retweeted
9

– Network
users through
– page rank
– Popularity
URL were
– retweeted
– liked
9

Methodology (4): Labelling the URLs
• Labelled independently by
– 2 persons having having more than 4 years of professional
programming experience in Java
– one a PhD student and another a Research Engineer
10

• Both persons sat together to resolve disagreements
10

• Both persons sat together to resolve disagreements
• URLs assigned relevance scores from 0-3
10

Methodology (5): Recommendation
• Unsupervised –Borda Count
– assigns ranking points for each feature score for an
URL and then combines the scores
11
• Supervised –Learning to Rank
– learns a ranking function based on the weighted sum
of features of an URL

RQ1: Effectiveness of Our Approach
12
• NDCG (Normalized Discounted Cumulative Gain)
• Measures the capability to recommend higher ranked URLs at
top ranks
• Score closer to 1 specifies better performance with the range
of scores being 0-1

RQ1: Effectiveness of Our Approach
12
0.832
0.719
0
0.2
0.4
0.6
0.8
1
Supervised Unsupervised
NDCGScore
Recommendation Approach
• NDCG (Normalized Discounted Cumulative Gain)
• Measures the capability to recommend higher ranked URLs at
top ranks
• Score closer to 1 specifies better performance with the range
of scores being 0-1

RQ2: Sensitivity of Supervised
Approach to Training Data
13
0.832
0.825
0.833
0.845
0.834
0.842
0.837
0.847
0.843
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
10
9
8
7
6
5
4
3
2
NDCG Score
k(nooffoldsused)

Threats to Validity
• Subjectivity in the labelling process
14

Threats to Validity
– asked 2 persons to label independently
14

Threats to Validity
• Only 1 domain
14

Threats to Validity
• Only 1 domain
– evaluate more domains in future work
14

Threats to Validity
• Only 1 domain
• Suitability of evaluation metric
14

Threats to Validity
• Only 1 domain
• Suitability of evaluation metric
– used NDCG which is a standard metric
14

Conclusion and Future Work
• Supervised and unsupervised approaches
show promise in recommending URLs
• Future work:
– Automatically categorize the recommended
URLs
– Build an automated system to recommend
relevant URLs
15

Feedback/Advice
• What additional resources we can
consider for mining URLs?
• How to infer developer interests
automatically?
Thank you!

�ݺ�ߣ

Saner17 sharma

More Related Content

Saner17 sharma