ºÝºÝߣ

ºÝºÝߣShare a Scribd company logo
SIGIR 2014
Gold Coast, Australia, 06-11 July 2014
Uncovering the Unarchived Web
Thaer Samar, Hugo Huurdeman, Anat Ben-David, Jaap Kamps, Arjen de Vries
Link Extraction
Input
Dutch Archive (2009-2012)

7 TB (compressed)

76,828 ARC files

147,641,512 documents
Seedlist info:

5,000 websites

Selection dates

Assigned UNESCO codes
Filtering & Deduplication

Focus on links of which the source was archived in 2012

Deduplication: Seeds are harvested at different frequencies

Deduplicated based on srcUrl, targetUrl, anchorText and hash of
source's content
General Framework
Introduction

Web archives contain more than Web pages: they contain page
sources, outlinks, anchor text, and timestamps of archive dates

Outlinks and their anchor text can be used to establish evidence
of pages which existed at crawling time that were not archived
Further Analysis
TLD distribution of inter-domain uncovered Web has
similarities to a broad Web crawl (Common Crawl)
TLD distribution of unarchived URLs
Conclusions

Uncovering pages of the Web that were not archived and would
have been lost forever

Recover representation of unarchived pages by exploiting link
graph and anchor text

Aggregating anchor text from all sources linking to the target

Information about the sources linking to the target:
•
number of (unique) sources
•
source categories based on the assigned UNESCO codes
•
indications whether a source is on the seedlist or not
Unique source URL & anchor word counts (inter-domain links)
Uncovered URL representations
Results
Representation Aggregation
For each link target :

Union all anchor text describing links pointing to one target

Count number of unique sources & UNESCO pointing to target

Count number of unique anchor text words used to link to target
Uncovered URLs Analysis

Distinguish between internal & external links

Internal link: source and target have same domain-name
(intra-domain): 8,692,308

External link: source and target have different domain-name
(inter-domain): 3,205,354
Categories of found URLs
1) Intentionally archived pages, they are from the seed list
2) Unintentionally archived pages, not from the seed list
(side-effect of crawling)
3) Aura: unarchived pages, we know they exist because there are
links to them from archived pages
The number of uncovered pages indirectly collected while crawling
is almost equal to the number of intentionally crawled pages!

More Related Content

Viewers also liked (7)

Karahanlilar
KarahanlilarKarahanlilar
Karahanlilar
dilaybulut
Ìý
Presentación diapositivas informaticaPresentación diapositivas informatica
Presentación diapositivas informatica
yuri bianney henandez sierra
Ìý
Şirketlerde Kri̇z Yöneti̇mi̇
Şirketlerde Kri̇z Yöneti̇mi̇Şirketlerde Kri̇z Yöneti̇mi̇
Şirketlerde Kri̇z Yöneti̇mi̇
nursenaunalan
Ìý
Psts Busbehav Profilefin
Psts Busbehav ProfilefinPsts Busbehav Profilefin
Psts Busbehav Profilefin
Satish Kale
Ìý
Embedded training
Embedded trainingEmbedded training
Embedded training
sowmiya437
Ìý
Amor Chebira Research publications
Amor Chebira Research publicationsAmor Chebira Research publications
Amor Chebira Research publications
Amor Chebira
Ìý
Ginecomastia cruzGinecomastia cruz
Ginecomastia cruz
maria elena
Ìý
Karahanlilar
KarahanlilarKarahanlilar
Karahanlilar
dilaybulut
Ìý
Presentación diapositivas informaticaPresentación diapositivas informatica
Presentación diapositivas informatica
yuri bianney henandez sierra
Ìý
Şirketlerde Kri̇z Yöneti̇mi̇
Şirketlerde Kri̇z Yöneti̇mi̇Şirketlerde Kri̇z Yöneti̇mi̇
Şirketlerde Kri̇z Yöneti̇mi̇
nursenaunalan
Ìý
Psts Busbehav Profilefin
Psts Busbehav ProfilefinPsts Busbehav Profilefin
Psts Busbehav Profilefin
Satish Kale
Ìý
Embedded training
Embedded trainingEmbedded training
Embedded training
sowmiya437
Ìý
Amor Chebira Research publications
Amor Chebira Research publicationsAmor Chebira Research publications
Amor Chebira Research publications
Amor Chebira
Ìý
Ginecomastia cruzGinecomastia cruz
Ginecomastia cruz
maria elena
Ìý

Similar to SIGIR2014_poster (20)

Finding Pages on the Unarchived Web (DL 2014)
Finding Pages on the Unarchived Web (DL 2014)Finding Pages on the Unarchived Web (DL 2014)
Finding Pages on the Unarchived Web (DL 2014)
TimelessFuture
Ìý
Google Paper
Google Paper Google Paper
Google Paper
girish1m
Ìý
It19 20140721 linked data personal perspective
It19 20140721 linked data personal perspectiveIt19 20140721 linked data personal perspective
It19 20140721 linked data personal perspective
Janifer Gatenby
Ìý
Tpdl 2016 topic_coverage_indf-bf_crawls
Tpdl 2016 topic_coverage_indf-bf_crawlsTpdl 2016 topic_coverage_indf-bf_crawls
Tpdl 2016 topic_coverage_indf-bf_crawls
Thaer Samar
Ìý
Leeds Met Open Search - towards an integrated solution for research and OER
Leeds Met Open Search - towards an integrated solution for research and OERLeeds Met Open Search - towards an integrated solution for research and OER
Leeds Met Open Search - towards an integrated solution for research and OER
Nick Sheppard
Ìý
Anatomy of google
Anatomy of googleAnatomy of google
Anatomy of google
Iftikhar Alam
Ìý
How to find information on the internet
How to find information on the internetHow to find information on the internet
How to find information on the internet
Dalia El-Shafei
Ìý
How Uniform Resource Locator Works by Preetam Sir
How Uniform Resource Locator Works by Preetam SirHow Uniform Resource Locator Works by Preetam Sir
How Uniform Resource Locator Works by Preetam Sir
PreetamDutta6
Ìý
Internet Research Presentation
Internet Research PresentationInternet Research Presentation
Internet Research Presentation
adeason
Ìý
The WARCnet Code Book of web archive data formats
The WARCnet Code Book of web archive data formatsThe WARCnet Code Book of web archive data formats
The WARCnet Code Book of web archive data formats
WARCnet
Ìý
DomainTools Fingerprinting Threat Actors with Web Assets
DomainTools Fingerprinting Threat Actors with Web AssetsDomainTools Fingerprinting Threat Actors with Web Assets
DomainTools Fingerprinting Threat Actors with Web Assets
DomainTools
Ìý
Call for papers
Call for papersCall for papers
Call for papers
SCHOLEDGE R&D CENTER
Ìý
ITWS 4310: Building and Consuming the Web of Data (Fall 2013)
ITWS 4310: Building and Consuming the Web of Data (Fall 2013)ITWS 4310: Building and Consuming the Web of Data (Fall 2013)
ITWS 4310: Building and Consuming the Web of Data (Fall 2013)
Rensselaer Polytechnic Institute
Ìý
Semantic web: where are we now?
Semantic web: where are we now? Semantic web: where are we now?
Semantic web: where are we now?
horvadam
Ìý
Finding and Managing Information
Finding and Managing InformationFinding and Managing Information
Finding and Managing Information
Neny Isharyanti
Ìý
Ensuring the Integrity (& Continuity) of Our Record of Scholarship
Ensuring the Integrity (& Continuity) of Our Record of ScholarshipEnsuring the Integrity (& Continuity) of Our Record of Scholarship
Ensuring the Integrity (& Continuity) of Our Record of Scholarship
EDINA, University of Edinburgh
Ìý
Copy-of-WWW-and-Website-URLsworld wide web and website URL s
Copy-of-WWW-and-Website-URLsworld wide web and website URL sCopy-of-WWW-and-Website-URLsworld wide web and website URL s
Copy-of-WWW-and-Website-URLsworld wide web and website URL s
revanthreddy0730
Ìý
Reuse of Repository Data
Reuse of Repository DataReuse of Repository Data
Reuse of Repository Data
Valerie Enriquez
Ìý
Engineering a Semantic Web: ITWS Capstone Lecture (Spring 2014)
Engineering a Semantic Web: ITWS Capstone Lecture (Spring 2014)Engineering a Semantic Web: ITWS Capstone Lecture (Spring 2014)
Engineering a Semantic Web: ITWS Capstone Lecture (Spring 2014)
Rensselaer Polytechnic Institute
Ìý
On building a search interface discovery system
On building a search interface discovery systemOn building a search interface discovery system
On building a search interface discovery system
Denis Shestakov
Ìý
Finding Pages on the Unarchived Web (DL 2014)
Finding Pages on the Unarchived Web (DL 2014)Finding Pages on the Unarchived Web (DL 2014)
Finding Pages on the Unarchived Web (DL 2014)
TimelessFuture
Ìý
Google Paper
Google Paper Google Paper
Google Paper
girish1m
Ìý
It19 20140721 linked data personal perspective
It19 20140721 linked data personal perspectiveIt19 20140721 linked data personal perspective
It19 20140721 linked data personal perspective
Janifer Gatenby
Ìý
Tpdl 2016 topic_coverage_indf-bf_crawls
Tpdl 2016 topic_coverage_indf-bf_crawlsTpdl 2016 topic_coverage_indf-bf_crawls
Tpdl 2016 topic_coverage_indf-bf_crawls
Thaer Samar
Ìý
Leeds Met Open Search - towards an integrated solution for research and OER
Leeds Met Open Search - towards an integrated solution for research and OERLeeds Met Open Search - towards an integrated solution for research and OER
Leeds Met Open Search - towards an integrated solution for research and OER
Nick Sheppard
Ìý
Anatomy of google
Anatomy of googleAnatomy of google
Anatomy of google
Iftikhar Alam
Ìý
How to find information on the internet
How to find information on the internetHow to find information on the internet
How to find information on the internet
Dalia El-Shafei
Ìý
How Uniform Resource Locator Works by Preetam Sir
How Uniform Resource Locator Works by Preetam SirHow Uniform Resource Locator Works by Preetam Sir
How Uniform Resource Locator Works by Preetam Sir
PreetamDutta6
Ìý
Internet Research Presentation
Internet Research PresentationInternet Research Presentation
Internet Research Presentation
adeason
Ìý
The WARCnet Code Book of web archive data formats
The WARCnet Code Book of web archive data formatsThe WARCnet Code Book of web archive data formats
The WARCnet Code Book of web archive data formats
WARCnet
Ìý
DomainTools Fingerprinting Threat Actors with Web Assets
DomainTools Fingerprinting Threat Actors with Web AssetsDomainTools Fingerprinting Threat Actors with Web Assets
DomainTools Fingerprinting Threat Actors with Web Assets
DomainTools
Ìý
ITWS 4310: Building and Consuming the Web of Data (Fall 2013)
ITWS 4310: Building and Consuming the Web of Data (Fall 2013)ITWS 4310: Building and Consuming the Web of Data (Fall 2013)
ITWS 4310: Building and Consuming the Web of Data (Fall 2013)
Rensselaer Polytechnic Institute
Ìý
Semantic web: where are we now?
Semantic web: where are we now? Semantic web: where are we now?
Semantic web: where are we now?
horvadam
Ìý
Finding and Managing Information
Finding and Managing InformationFinding and Managing Information
Finding and Managing Information
Neny Isharyanti
Ìý
Ensuring the Integrity (& Continuity) of Our Record of Scholarship
Ensuring the Integrity (& Continuity) of Our Record of ScholarshipEnsuring the Integrity (& Continuity) of Our Record of Scholarship
Ensuring the Integrity (& Continuity) of Our Record of Scholarship
EDINA, University of Edinburgh
Ìý
Copy-of-WWW-and-Website-URLsworld wide web and website URL s
Copy-of-WWW-and-Website-URLsworld wide web and website URL sCopy-of-WWW-and-Website-URLsworld wide web and website URL s
Copy-of-WWW-and-Website-URLsworld wide web and website URL s
revanthreddy0730
Ìý
Reuse of Repository Data
Reuse of Repository DataReuse of Repository Data
Reuse of Repository Data
Valerie Enriquez
Ìý
Engineering a Semantic Web: ITWS Capstone Lecture (Spring 2014)
Engineering a Semantic Web: ITWS Capstone Lecture (Spring 2014)Engineering a Semantic Web: ITWS Capstone Lecture (Spring 2014)
Engineering a Semantic Web: ITWS Capstone Lecture (Spring 2014)
Rensselaer Polytechnic Institute
Ìý
On building a search interface discovery system
On building a search interface discovery systemOn building a search interface discovery system
On building a search interface discovery system
Denis Shestakov
Ìý

Recently uploaded (20)

Fundamentals of ALD: tutorial, at ALD for Industry, Dresden, by Puurunen 2025...
Fundamentals of ALD: tutorial, at ALD for Industry, Dresden, by Puurunen 2025...Fundamentals of ALD: tutorial, at ALD for Industry, Dresden, by Puurunen 2025...
Fundamentals of ALD: tutorial, at ALD for Industry, Dresden, by Puurunen 2025...
Riikka Puurunen
Ìý
Overview of basic statistical mechanics of NNs
Overview of basic statistical  mechanics of NNsOverview of basic statistical  mechanics of NNs
Overview of basic statistical mechanics of NNs
Charles Martin
Ìý
epidemiology (aim, component, principles).pptx
epidemiology (aim, component, principles).pptxepidemiology (aim, component, principles).pptx
epidemiology (aim, component, principles).pptx
lopamudraray88
Ìý
OECD 423 GUIDELINES AND COMPARISON WITH THE 420 AND 425.
OECD 423 GUIDELINES AND COMPARISON WITH THE 420 AND 425.OECD 423 GUIDELINES AND COMPARISON WITH THE 420 AND 425.
OECD 423 GUIDELINES AND COMPARISON WITH THE 420 AND 425.
SanjaySinghrajwar
Ìý
Successful management of intussusception in a cow under double drip anaesthesia
Successful management of intussusception  in a cow under double drip anaesthesiaSuccessful management of intussusception  in a cow under double drip anaesthesia
Successful management of intussusception in a cow under double drip anaesthesia
rajvet4163
Ìý
Animal husbandry: Purpose, scope and management,dairy animals, breeds and eco...
Animal husbandry: Purpose, scope and management,dairy animals, breeds and eco...Animal husbandry: Purpose, scope and management,dairy animals, breeds and eco...
Animal husbandry: Purpose, scope and management,dairy animals, breeds and eco...
tibhathakur77
Ìý
sem 4 Medicinal chemistry-1 (unit 3)pptx
sem 4 Medicinal chemistry-1 (unit 3)pptxsem 4 Medicinal chemistry-1 (unit 3)pptx
sem 4 Medicinal chemistry-1 (unit 3)pptx
payalpilaji
Ìý
Research problem identification and selection - PDF.pptx
Research problem identification and selection - PDF.pptxResearch problem identification and selection - PDF.pptx
Research problem identification and selection - PDF.pptx
Suadzuhair1
Ìý
quantitative genetics part 3.pdf agriculture
quantitative genetics part 3.pdf agriculturequantitative genetics part 3.pdf agriculture
quantitative genetics part 3.pdf agriculture
KushiBhatia
Ìý
Understanding Ionic Product of Water.pptx
Understanding Ionic Product of Water.pptxUnderstanding Ionic Product of Water.pptx
Understanding Ionic Product of Water.pptx
mhdzydshoppinggamma
Ìý
Science, Technology and Society_PPT Chapter 3.pptx
Science, Technology and Society_PPT Chapter 3.pptxScience, Technology and Society_PPT Chapter 3.pptx
Science, Technology and Society_PPT Chapter 3.pptx
JamaicaRoseHipolito
Ìý
Moulding techniques for polymers industrial process
Moulding techniques for polymers industrial processMoulding techniques for polymers industrial process
Moulding techniques for polymers industrial process
JinnJinnkiJaddu
Ìý
Units and measurements includes definition and fundamental quantities.pptx
Units and measurements includes definition and fundamental quantities.pptxUnits and measurements includes definition and fundamental quantities.pptx
Units and measurements includes definition and fundamental quantities.pptx
Dr Sarika P Patil
Ìý
Cosmic_Rays_template___Presentation.pptx
Cosmic_Rays_template___Presentation.pptxCosmic_Rays_template___Presentation.pptx
Cosmic_Rays_template___Presentation.pptx
jennyfernando2203
Ìý
Germplasm resources and conservation source's of germplasm for plant breeding
Germplasm resources and conservation source's of germplasm for plant breedingGermplasm resources and conservation source's of germplasm for plant breeding
Germplasm resources and conservation source's of germplasm for plant breeding
EshaEman27
Ìý
Electrophoretic Technique Electro .pptx
Electrophoretic Technique Electro  .pptxElectrophoretic Technique Electro  .pptx
Electrophoretic Technique Electro .pptx
nghns4wcvc
Ìý
Common Laboratory Animals i.e.Frog .pptx
Common Laboratory Animals i.e.Frog .pptxCommon Laboratory Animals i.e.Frog .pptx
Common Laboratory Animals i.e.Frog .pptx
VaishnaviAware
Ìý
L1_4Q EARTHQUAKES - MATATAG CURRICULUM sci 7.pptx
L1_4Q EARTHQUAKES - MATATAG CURRICULUM sci 7.pptxL1_4Q EARTHQUAKES - MATATAG CURRICULUM sci 7.pptx
L1_4Q EARTHQUAKES - MATATAG CURRICULUM sci 7.pptx
APRILJOYMANAIG
Ìý
Climate Information for Society: Attribution and Engineering
Climate Information for Society: Attribution and EngineeringClimate Information for Society: Attribution and Engineering
Climate Information for Society: Attribution and Engineering
Zachary Labe
Ìý
MYSTERYHU FORMULAE FORMIDABLE COMBINATIONS.pptx
MYSTERYHU  FORMULAE  FORMIDABLE  COMBINATIONS.pptxMYSTERYHU  FORMULAE  FORMIDABLE  COMBINATIONS.pptx
MYSTERYHU FORMULAE FORMIDABLE COMBINATIONS.pptx
EFRUZHUCANCERTHERAPY
Ìý
Fundamentals of ALD: tutorial, at ALD for Industry, Dresden, by Puurunen 2025...
Fundamentals of ALD: tutorial, at ALD for Industry, Dresden, by Puurunen 2025...Fundamentals of ALD: tutorial, at ALD for Industry, Dresden, by Puurunen 2025...
Fundamentals of ALD: tutorial, at ALD for Industry, Dresden, by Puurunen 2025...
Riikka Puurunen
Ìý
Overview of basic statistical mechanics of NNs
Overview of basic statistical  mechanics of NNsOverview of basic statistical  mechanics of NNs
Overview of basic statistical mechanics of NNs
Charles Martin
Ìý
epidemiology (aim, component, principles).pptx
epidemiology (aim, component, principles).pptxepidemiology (aim, component, principles).pptx
epidemiology (aim, component, principles).pptx
lopamudraray88
Ìý
OECD 423 GUIDELINES AND COMPARISON WITH THE 420 AND 425.
OECD 423 GUIDELINES AND COMPARISON WITH THE 420 AND 425.OECD 423 GUIDELINES AND COMPARISON WITH THE 420 AND 425.
OECD 423 GUIDELINES AND COMPARISON WITH THE 420 AND 425.
SanjaySinghrajwar
Ìý
Successful management of intussusception in a cow under double drip anaesthesia
Successful management of intussusception  in a cow under double drip anaesthesiaSuccessful management of intussusception  in a cow under double drip anaesthesia
Successful management of intussusception in a cow under double drip anaesthesia
rajvet4163
Ìý
Animal husbandry: Purpose, scope and management,dairy animals, breeds and eco...
Animal husbandry: Purpose, scope and management,dairy animals, breeds and eco...Animal husbandry: Purpose, scope and management,dairy animals, breeds and eco...
Animal husbandry: Purpose, scope and management,dairy animals, breeds and eco...
tibhathakur77
Ìý
sem 4 Medicinal chemistry-1 (unit 3)pptx
sem 4 Medicinal chemistry-1 (unit 3)pptxsem 4 Medicinal chemistry-1 (unit 3)pptx
sem 4 Medicinal chemistry-1 (unit 3)pptx
payalpilaji
Ìý
Research problem identification and selection - PDF.pptx
Research problem identification and selection - PDF.pptxResearch problem identification and selection - PDF.pptx
Research problem identification and selection - PDF.pptx
Suadzuhair1
Ìý
quantitative genetics part 3.pdf agriculture
quantitative genetics part 3.pdf agriculturequantitative genetics part 3.pdf agriculture
quantitative genetics part 3.pdf agriculture
KushiBhatia
Ìý
Understanding Ionic Product of Water.pptx
Understanding Ionic Product of Water.pptxUnderstanding Ionic Product of Water.pptx
Understanding Ionic Product of Water.pptx
mhdzydshoppinggamma
Ìý
Science, Technology and Society_PPT Chapter 3.pptx
Science, Technology and Society_PPT Chapter 3.pptxScience, Technology and Society_PPT Chapter 3.pptx
Science, Technology and Society_PPT Chapter 3.pptx
JamaicaRoseHipolito
Ìý
Moulding techniques for polymers industrial process
Moulding techniques for polymers industrial processMoulding techniques for polymers industrial process
Moulding techniques for polymers industrial process
JinnJinnkiJaddu
Ìý
Units and measurements includes definition and fundamental quantities.pptx
Units and measurements includes definition and fundamental quantities.pptxUnits and measurements includes definition and fundamental quantities.pptx
Units and measurements includes definition and fundamental quantities.pptx
Dr Sarika P Patil
Ìý
Cosmic_Rays_template___Presentation.pptx
Cosmic_Rays_template___Presentation.pptxCosmic_Rays_template___Presentation.pptx
Cosmic_Rays_template___Presentation.pptx
jennyfernando2203
Ìý
Germplasm resources and conservation source's of germplasm for plant breeding
Germplasm resources and conservation source's of germplasm for plant breedingGermplasm resources and conservation source's of germplasm for plant breeding
Germplasm resources and conservation source's of germplasm for plant breeding
EshaEman27
Ìý
Electrophoretic Technique Electro .pptx
Electrophoretic Technique Electro  .pptxElectrophoretic Technique Electro  .pptx
Electrophoretic Technique Electro .pptx
nghns4wcvc
Ìý
Common Laboratory Animals i.e.Frog .pptx
Common Laboratory Animals i.e.Frog .pptxCommon Laboratory Animals i.e.Frog .pptx
Common Laboratory Animals i.e.Frog .pptx
VaishnaviAware
Ìý
L1_4Q EARTHQUAKES - MATATAG CURRICULUM sci 7.pptx
L1_4Q EARTHQUAKES - MATATAG CURRICULUM sci 7.pptxL1_4Q EARTHQUAKES - MATATAG CURRICULUM sci 7.pptx
L1_4Q EARTHQUAKES - MATATAG CURRICULUM sci 7.pptx
APRILJOYMANAIG
Ìý
Climate Information for Society: Attribution and Engineering
Climate Information for Society: Attribution and EngineeringClimate Information for Society: Attribution and Engineering
Climate Information for Society: Attribution and Engineering
Zachary Labe
Ìý
MYSTERYHU FORMULAE FORMIDABLE COMBINATIONS.pptx
MYSTERYHU  FORMULAE  FORMIDABLE  COMBINATIONS.pptxMYSTERYHU  FORMULAE  FORMIDABLE  COMBINATIONS.pptx
MYSTERYHU FORMULAE FORMIDABLE COMBINATIONS.pptx
EFRUZHUCANCERTHERAPY
Ìý

SIGIR2014_poster

  • 1. SIGIR 2014 Gold Coast, Australia, 06-11 July 2014 Uncovering the Unarchived Web Thaer Samar, Hugo Huurdeman, Anat Ben-David, Jaap Kamps, Arjen de Vries Link Extraction Input Dutch Archive (2009-2012)  7 TB (compressed)  76,828 ARC files  147,641,512 documents Seedlist info:  5,000 websites  Selection dates  Assigned UNESCO codes Filtering & Deduplication  Focus on links of which the source was archived in 2012  Deduplication: Seeds are harvested at different frequencies  Deduplicated based on srcUrl, targetUrl, anchorText and hash of source's content General Framework Introduction  Web archives contain more than Web pages: they contain page sources, outlinks, anchor text, and timestamps of archive dates  Outlinks and their anchor text can be used to establish evidence of pages which existed at crawling time that were not archived Further Analysis TLD distribution of inter-domain uncovered Web has similarities to a broad Web crawl (Common Crawl) TLD distribution of unarchived URLs Conclusions  Uncovering pages of the Web that were not archived and would have been lost forever  Recover representation of unarchived pages by exploiting link graph and anchor text  Aggregating anchor text from all sources linking to the target  Information about the sources linking to the target: • number of (unique) sources • source categories based on the assigned UNESCO codes • indications whether a source is on the seedlist or not Unique source URL & anchor word counts (inter-domain links) Uncovered URL representations Results Representation Aggregation For each link target :  Union all anchor text describing links pointing to one target  Count number of unique sources & UNESCO pointing to target  Count number of unique anchor text words used to link to target Uncovered URLs Analysis  Distinguish between internal & external links  Internal link: source and target have same domain-name (intra-domain): 8,692,308  External link: source and target have different domain-name (inter-domain): 3,205,354 Categories of found URLs 1) Intentionally archived pages, they are from the seed list 2) Unintentionally archived pages, not from the seed list (side-effect of crawling) 3) Aura: unarchived pages, we know they exist because there are links to them from archived pages The number of uncovered pages indirectly collected while crawling is almost equal to the number of intentionally crawled pages!