This document summarizes a presentation about Google Scholar's citation graph. It discusses both the strengths and weaknesses of data available from Google Scholar. On the strengths side, it notes Google Scholar's large size and coverage of documents not found in other indexes. However, it also outlines weaknesses like the lack of support for data reuse and incomplete/erroneous metadata. The document describes personal experiences extracting and analyzing Google Scholar data to build new tools and data products, showing what could be possible with open access to the data.
1 of 28
Downloaded 28 times
More Related Content
Google Scholar's citation graph: comprehensive, global... and inaccessible
1. Google Scholars citation graph:
comprehensive, global and inaccessible
Alberto Mart鱈n-Mart鱈n, Emilio Delgado L坦pez-C坦zar
Open Citations seminar, May 23rd, 2018, Uppsala, Sweden
2. 2
TEAM
EMILIO DELGADO LPEZ-CZAR
PROFESSOR
@ UNIVERSIDAD DE GRANADA
ENRIQUE ORDUA-MALEA
ASSISTANT PROFESSOR
@ UNIVERSIDAD POLITCNICA
DE VALENCIA
ALBERTO MARTN-MARTN
PHD STUDENT
@ UNIVERSIDAD DE GRANADA
3. CONTEXT
How we got to this point
STRUCTURE OF THE TALK
3
STRENGTHS
of data available in Google Scholar
WEAKNESSES
of data available in Google Scholar
PERSONAL EXPERIENCES
getting data from Google Scholar to generate data products
6. 6
CITATION INDEXES
Selective coverage based on source
selection
Commercial (subscription-based).
License to access to data in bulk
separate from license to web
application
Inclusive coverage based on
parsing webpages
Non-comercial service offered
by Google. Free to access.
Doesnt offer options to access
data in bulk (agreements with
publishers preclude it)
7. 10
THE NEED OF OPEN CITATIONS
7
The citation graph is
one of humankind's
most important
intellectual
achievements
DARIO
TARABORELLI
FOUNDER OF I4OC
source
In this open-access
age, it is a scandal
that reference lists
from journal articles
[] are not readily
and freely available
for use by all
scholars.
DAVID SHOTTON
FOUNDER OF OCC
source
[I]n order to guarantee
full transparency and
reproducibility of
scientometric
analyses, these
analyses need to be
based on open data
sources
ISSI
source
9. 9
OVERALL SIZE
Khabsa & Giles (2014): around 100 million documents (only in English)
Orduna-Malea et al. (2015): 130-180 million documents (no language restrictions)
Roughly 2-3 times the size of Web of Science and Scopus. There are disciplinary differences as well see later on
Khabsa, M., & Giles, C. L. (2014). The number of scholarly documents on the public web. PloS one, 9(5), e93949.
https://doi.org/10.1371/journal.pone.0093949
Orduna-Malea, E., Ayll坦n, J. M., Mart鱈n-Mart鱈n, A., & Delgado L坦pez-C坦zar, E. (2015). Methods for estimating the size of Google
Scholar. Scientometrics, 104(3), 931-949. https://doi.org/10.1007/s11192-015-1614-6
10. 10
DOCUMENT COVERAGE (1)
For a sample of 64,000 highly-cited
documents according to Google Scholar,
49% of these documents were not
covered by Web of Science
Martin-Martin, A., Orduna-Malea, E., Harzing, A. W., & Delgado L坦pez-C坦zar, E.
(2017). Can we use Google Scholar to identify highly-cited documents?. Journal
of Informetrics, 11(1), 152-163. https://doi.org/10.1016/j.joi.2016.11.008
Mart鱈n-Mart鱈n, A., Orduna-Malea, E., Ayll坦n, J. M., & Delgado L坦pez-C坦zar, E.
(2016). A two-sided academic landscape: snapshot of highly-cited documents in
Google Scholar (1950-2013). Revista Espa単ola de Documentaci坦n Cient鱈fica,
39(4), 1. https://doi.org/10.3989/redc.2016.4.1405
11. 11
DOCUMENT COVERAGE (2)
Classic Papers: Highly cited documents published in 2006 according to Google Scholar
(released in June 2017)
252 unique subcategories, 8 broad categories covering all areas of knowledge
10 most cited documents in each subcategory. At least 20 citations per paper. Total number of articles:
2,515 (one category had only 5 documents)
Category Number of documents
Not found in WoS
(%)
Not found in Scopus
(%)
Humanities, Literature & Arts 245 28.2 17.1
Social Sciences 510 17.5 8.6
Engineering & Computer Science 570 11.6 2.5
Business, Economics & Management 150 6.0 2.7
Health & Medical Sciences 680 2.8 0.3
Physics & Mathematics 230 2.2 1.7
Life Sciences & Earth Sciences 380 0.5 0.5
Chemical & Material Sciences 170 0 0
Mart鱈n-Mart鱈n, A., Orduna-Malea, E., & L坦pez-C坦zar, E. D. (2018, April 23). Coverage of highly-cited
documents in Google Scholar, Web of Science, and Scopus: a multidisciplinary comparison.
http://doi.org/10.17605/OSF.IO/HCX27
12. 12
CITATIONS FOUND (1)
Average log-transformed citation counts of highly-cited documents according to Google
Scholar published in 2006, based on data from Google Scholar, Web of Science, and Scopus, by
broad subject categories
13. 13
CITATIONS FOUND (2)
2.30
M
95%
91%
Total number of citations to 2,299 highly-cited
documents from 2006 covered by GS, WoS,
and Scopus
Of all citations found by WoS (1.27 M) are
also found by Google Scholar
Of all citations found by Scopus (1.47) are
also found by Google Scholar
We extracted the list documents that cite these highly-cited documents from GS
(custom script), WoS (web export), and Scopus (web export)
14. 14
CITATIONS FOUND (3)
HUMANITIES, LITERATURE & ARTS CHEMICAL & MATERIAL SCIENCES
What sources / document types does GS cover that WoS and Scopus do not?
15. 15
CITATIONS FOUND (4)
Analysis of articles or reviews with a DOI published in 2009 covered by Web of Science and Google
Scholar (~1 million documents)
Citation Index N spearman.cor p.value prop.cited.gs prop.cited.wos ratio of gs_cit to wos_cit (avg)
Sciences 863801 0,94 0,00 0,97 0,95 1,68
Social Sciences 109232 0,90 0,00 0,97 0,94 2,58
Art & Humanities 13487 0,83 0,00 0,84 0,69 2,52
Sciences Social Sciences Arts & Humanities
17. No support for data reuse: no API available. All data extraction has
to be made through web scraping
Agreements with publishers preclude them from releasing the data
Tight security measures to avoid massive data collection (CAPTCHAs)
Persistent identifiers (DOIs, ORCIDs) are not available to the
public (although they use DOIs internally)
Only 1,000 results can be displayed for any given query
Inability to fix individual errors. Very small team of people working
on GS. Everything is automated.
Incorrect assignment of documents to researcher profiles (GSC)
LIMITATIONS (1)
CONSEQUENCE OF TRYING TO USE A TOOL FOR A PURPOSE IT WASNT DESIGNED FOR
18. LIMITATIONS (2)
CONSEQUENCE OF TRYING TO USE A TOOL FOR A PURPOSE IT WASNT DESIGNED FOR
Incomplete or erroneous basic metadata, with little or no support for
categorical variables at the document level:
incomplete lists of authors!!
truncated journal names!!
no document types
no author affiliations (only at author-level in GSC)
no subject classifications
Forget about funding acknowledgements
Undetected duplicates
Open to manipulation
Delgado L坦pezC坦zar, E., RobinsonGarc鱈a, N., & TorresSalinas, D. (2014). The Google Scholar experiment: How to index false papers and
manipulate bibliometric indicators. Journal of the Association for Information Science and Technology, 65(3), 446-454.
https://doi.org/10.1002/asi.23056
Cited references of documents are not available
19. LIMITATIONS (3)
IS THERE A WAY TO OVERCOME THEM
that doesnt involve paying mind-numbingly high amounts of money
to some multinational?
Complementing Google Scholar metadata with data from other
freely accessible sources:
Going to the source: Metadata in publisher websites or
repositories
CrossRef Metadata API (for everything with a DOI)
Complete basic metadata.
Cited references available for over 51% of their records (so far
Springer, Wiley, and some smaller publishers have agreed to
make them public). https://i4oc.org/
Author affiliations available from some publishers
Digging deeper into Google Scholar: they have more
metadata, but its expensive to extract it (time-wise).
21. 21
SOFTWARE THAT HANDLES GOOGLE SCHOLAR
DATA
Publish or Perish, by Anne-Wil Harzing: https://harzing.com/resources/publish-or-perish
Scholarometer, from School of Informatics and Computing, Indiana University-Bloomington (currently doesnt
work): http://scholarometer.indiana.edu
Scholar Plot (generates plots to visualize an authors academic career, using GS data):
http://scholarplot.com/help.html
R Package: scholar: Analyse Citation Data from Google Scholar: https://cran.r-
project.org/web/packages/scholar/index.html
R Package: scholarnetwork: Extract and Visualize Google Scholar Collaboration Networks:
https://github.com/pablobarbera/scholarnetwork
R Package: cv (builds a list of publications by an author by extracting data from GS): https://github.com/bomeara/cv
R Package: Web::Scraper::Citations (scrapes data from GS author profiles): https://github.com/JJ/net-citations-
scraper
Tutorial: Put Google Scholar citations on your personal website with R, scholar, ggplot2 and cron: https://www.r-
bloggers.com/put-google-scholar-citations-on-your-personal-website-with-r-scholar-ggplot2-and-cron/
Tutorial: Google scholar scraping with rvest package: https://datascienceplus.com/google-scholar-scraping-with-
rvest/
Tutorial: Scraping Google Scholar to write your PhD literature chapter: https://mystudentvoices.com/scraping-
google-scholar-to-write-your-phd-literature-chapter-2ea35f8f4fa1
22. 22
MY WORKFLOW (1)
No magic recipe. I have to deal with CAPTCHAs and scrape the raw HTML just like everyone
else.
Query embedded in URL: https://scholar.google.com/scholar_lookup?doi=10.1002/asi.23056
STRUCTURE OF A GOOGLE SCHOLAR RECORD
23. 23
MY WORKFLOW (2)
Two-step process (+ cleaning):
1. Getting raw HTML:
Python script that reads list of queries
Selenium Webdriver (a headless browser doesnt work because it is necessary to solve CAPTCHAs)
Pagination is taken into account (Google Scholar displays a maximum of 10 records per page, and a
maximum of 1000 records per query)
When a CAPTCHA appears, the script pauses. A human has to solve the CAPTCHA, then the script can
resume.
Each page is saved as an HTML file.
More computers: faster.
2. Parsing HTML:
Once all raw files have been downloaded.
Another other Python script, using Scrapy library, reads these files.
Using Xpath, relevant data is identified within the HTML.
Data is saved to csv file.
3. Cleaning: getting more complete metadata from source website and/or CrossRef
DATA EXTRACTION PROCESS
24. 24
JOURNAL SCHOLAR METRICS
Presents bibliometric indicators for 9,196 journals in
the Arts, Humanities, and Social Sciences, by
discipline.
These areas have traditionally presented more
difficulties in terms of bibliometric assessment,
mainly because of the lack of international,
geographically and linguistically unbiased tools
http://www.journal-scholar-metrics.infoec3.es
DATA PRODUCTS (1)
BUILDING NEW TOOLS ON TOP OF GOOGLE SCHOLAR DATA
25. 25
SCHOLAR MIRRORS
Bibliometric and altmetric indicators for
authors, documents, journals, and publishers
in the field of Bibliometrics, Scientometrics,
Informetrics, Webometrics, and Altmetrics
in Google Scholar Citations, ResearcherID,
Researchgate, Mendeley, and Twitter.
Data was extracted from Google Scholar,
ResearcherID, ResearchGate, Mendeley,
and Twitter.
http://www.scholar-mirrors.infoec3.es
DATA PRODUCTS (1)
BUILDING NEW TOOLS ON TOP OF GOOGLE SCHOLAR DATA
26. 26
WORK IN PROGRESS
Google Scholar-powered scientific information
system that displays data about all researchers
working in Spain.
Dataset: 44,500 profiles in Google Scholar
Citations (profile service) 2 million
documents approx. 30 million citations
Necessary: generating a document-level
classification for this collection of GS data. Im
using Ludo Waltmans and Nees Jan van Ecks
smart local moving algorithm.
DATA PRODUCTS (1)
BUILDING NEW TOOLS ON TOP OF GOOGLE SCHOLAR DATA
27. ONE LAST THOUGHT
If you make data available, people will build on top of it
We want to provide a glimpse of what could be possible to do with Google
Scholar data, if it stopped being a black box