Presentation given at ASCoR Spring Colloquium Big Data at the University of Amsterdam on February 18, 2014
1 of 36
Download to read offline
More Related Content
Connecting political data to media data
1. Connecting political data to media data
Laura Hollink
VU University Amsterdam
Web & Media group
ASCoR Spring Colloquium Big Data at the University of Amsterdam
February 18, 2014
5. Questions we want to answer
Which events have attracted
a lot of media attention?
What are the differences
between different media?
E.g. in different newspapers,
or newspapers vs. radio
bulletins?
Has the coverage changed
over time?
How are the events visualized
(photos, layout of newspaper,
etc.).
7. Transcriptions of all 9,294
meetings of the Dutch
parliament between
1945-1995, consisting of
1,208,903 speeches.
8. Transcriptions of all 9,294
meetings of the Dutch
parliament between
1945-1995, consisting of
1,208,903 speeches.
Archives of hundreds of
newspaper with tons of
newspaper issues or 10s
of Millions of articles
between 1618-1995.
(We only use 1945-1995)
9. Transcriptions of all 9,294
meetings of the Dutch
parliament between
1945-1995, consisting of
1,208,903 speeches.
Roughly 1.8 Million news
bulletins between
1937-1984
(We only use 1945-1995)
Archives of hundreds of
newspaper with tons of
newspaper issues or 10s
of Millions of articles
between 1618-1995.
(We only use 1945-1995)
11. Step 1: Translate the Dutch parliamentary debates
to the standard structured web format RDF
XML by
War in
Parliament
Project
Handelingen Verenigde
Vergadering...
Debate
PartOfDebate
DebateContext
rdf:type
rdf:type
rdf:type
1945-11-20
dc:date
Dutch
dc:language
nl.proc.sgd.d.
194519460000002
hasPart
nl.proc.sgd.d.
194519460000002.1
hasPart
nl.proc.sgd.d.
194519460000002.1.1
hasText
"De voorzitter
opent de
vergadering"
dc:publisher
dc:id
http://statengeneraaldigitaal.nl/
dc:source
nl.proc.sgd.d.19720000002
hasSubsequentPartOfDebate
hasPart
dc:source
http://resolver.politicalmashup.nl/nl.proc.sgd.d.194519460000002
"Mijnheer de
Voorzitter, de
Commissie
van "
member_of
_parliament
Speech
nl.proc.sgd.d.
194519460000002.2
hasSpokenText
hasRole
rdf:type
rdf:type
http://resolver.kb.nl/resolve?urn=sgd:mpeg21:19451946:0000002:pdf
Joannes Antonius James
Politician
foaf:鍖rstName
Barge
foaf:lastName
nl.proc.sgd.d.
194519460000002.1.2
sem:hasActor
hasSpeaker
Speaker_0006
4
rdfs:label
Barge
dc:source
coveredIn
http://resolver.kb.nl/resolve?urn=ddd:011198136:mpeg21:a0525:ocr
hasSubsequentSpeech
http://resolver.politicalmashup.nl/nl.m.00064
hasParty
nl.proc.sgd.d.
194519460000002.1.3
Party
Katholieke Volkspartij
rdf:type
hasFullName
Party_kvp
hasAcronym
KVP
12. Modeling the debates as events
An event has a date, a
location, actors, and
possibly sub-events.
We build on the Simple
Event Model (SEM).
links to the original sources
reusing existing
vocabularies
Handelingen Verenigde
Vergadering...
Debate
dc:title
1945-11-20
rdf:type
dc:date
Dutch
dc:language
nl.proc.sgd.d.
194519460000002
dc:publisher
dc:id
http://statengeneraaldigitaal.nl/
dc:source
nl.proc.sgd.d.19720000002
dc:source
http://resolver.politicalmashup.nl/nl.proc.sgd.d.194519460000002
http://resolver.kb.nl/resolve?urn=sgd:mpeg21:19451946:0000002:pdf
14. "Mijnheer de
Voorzitter, de
Commissie
van "
Speech
hasSpokenText
rdf:type
member_of
_parliament
Politician
Joannes Antonius James
hasRole
rdf:type
foaf:鍖rstName
Barge
foaf:lastName
nl.proc.sgd.d.
194519460000002.1.2
sem:hasActor
coveredIn
hasSpeaker
Speaker_0006
4
rdfs:label
Barge
hasParty
Party
http://resolver.kb.nl/resolve?urn=ddd:011198136:mpeg21:a0525:ocr
Katholieke Volkspartij
rdf:type
hasFullName
Party_kvp
the different roles and parties
that a speaker can have in his/
her career.
hasAcronym
KVP
15. Step 2: Linking speeches in the debate to the
newspaper articles that cover them
We created a linking method to deal with our two challenges:
1.How to link documents that are so different in nature?
2. Can we use the structure of the debates: people, chronologic
order of speeches, introductions to each new topic, etc?
Name of
speaker
Date of
debate
Search
newspaper
archive
Candidate
articles
Rank
candidate
articles
Debates
Detect
topics in
speeches
Topics
Create
queries
Detect
Named
Entities in
speeches
Named
Entities
Queries
Links
between
speeches
and articles
16. Step 2: Linking speeches in the debate to the
newspaper articles that cover them
Intuition 1: The name of the speaker should
appear in the article and the article should
be published within a week of the debate
Name of
speaker
Date of
debate
Search
newspaper
archive
Candidate
articles
Rank
candidate
articles
Debates
Detect
topics in
speeches
Topics
Create
queries
Detect
Named
Entities in
speeches
Named
Entities
Queries
Links
between
speeches
and articles
17. Step 2: Linking speeches in the debate to the
newspaper articles that cover them
Intuition 1: The name of the speaker should
appear in the article and the article should
be published within a week of the debate
Name of
speaker
Date of
debate
Search
newspaper
archive
Candidate
articles
Rank
candidate
articles
Debates
Detect
topics in
speeches
Topics
Create
queries
Detect
Named
Entities in
speeches
Named
Entities
Links
between
speeches
and articles
Queries
Intuition 2: the more the article and the
speech overlap in terms of topics and
named entities, the more they are related.
18. Evaluation: what do we use to rank the candidate
articles?
Experiment on 150 <newspaper article, speech in debate> pairs, 2 raters, K
= 0.5
Compare text of candidate articles to:
Setting 1: Named Entities in speech
Setting 2: Named Entities + Topics in speech
Setting 3: Named Entities + Topics in speech and larger part-of-debate
Score
Setting 1 Setting 2 Setting 3
I dont know
0.14
0.15
0.08
0 - unrelated
0.38
0.23
0.12
1- related
0.29
0.36
0.36
2- explicit mention of the debate 0.19
0.26
0.44
1+2
0.62
0.80
0.48
19. Results
An open data set of Dutch parliamentary debates,
with almost 3 Million
links between 450.000 speeches and URLs of 1.5
Million news paper articles and radio bulletins at the National Library.
accessible though a Web demonstrator and through a SPARQL endpoint.
27. SPARQL endpoint
A service to query a knowledge
base using the SPARQL query
language.
All speeches with more
than 60 associated news
items.
SELECT ?speech ?no_newsitems {{
SELECT ?speech (COUNT(?news) AS ?no_news_items)
WHERE{
?speech <http://purl.org/linkedpolitics/nl/polivoc#coveredAt> ?news .
}
GROUP BY ?speech }
FILTER (?no_news_items > 60) }
32. Re鍖ection: to what extend can we answer these
questions?
Which events have attracted
a lot of media attention?
What are the differences
between different media?
E.g. in different newspapers,
or newspapers vs. radio
bulletins?
Has the coverage changed
over time?
How are the events visualized
(photos, layout of newspaper,
etc.).
33. Future work
More types of links
From just coveredIn to quotedIn, coveredIn, backgroundOf
talksAbout
More types of media
More types of (political) events.
34. Project Talk of Europe / Traveling Clarin Campus
2014-2015
Funded by CLARIN-ERIC
From left to right: Max Kemman, Marnix van Berchum, Laura Hollink, Astrid van Aggelen, Steven Krauwer,
Henri Beunders. (Unfortunately, Martijn Kleppe and Johan Oomen were not present to join the group pic.)
35. Plans of ToE/TTC
1.Publish proceedings of the EU parliamentary debates in RDF
hosted by DANS
2.Organize 3 workshops/hackathons/Traveling Clarin Campuses in which we
invite international partners to work with the data.
3.In collaboration with international partners:
enrich with annotations, e.g. topics, structured data about people, parties,
etc.
link to national datasets, e.g. media or national parliaments