The document summarizes the PoliticalMashup project, which aims to connect promises and actions of politicians with societal reactions by integrating large datasets. It discusses using text analytics and XML techniques on datasets like Dutch parliamentary proceedings and election manifestos to enable automated analysis. Example applications include search, entity linking, and detecting promises by ministers. It also outlines several areas for natural language processing research using the datasets, such as topic detection and modeling populist language.
Semantic Archive Integration for Holocaust Research: the EHRI Research Infras...Vladimir Alexiev, PhD, PMP
油
The European Holocaust Research Infrastructure (EHRI) is a large-scale EU project that involves 23 institutions and archives working on Holocaust studies, from Europe, Israel and the US. In its first phase (2011-2015) it aggregated archival descriptions and materials on a large scale and built a Virtual Research Environment (portal) for Holocaust researchers based on a graph database.
In its second phase (2015-2019), EHRI2 seeks to enhance the gathered materials using semantic approaches: enrichment, coreferencing, interlinking. Semantic integration involves four of the 14 EHRI2 work packages and helps integrate databases, free text, and metadata to interconnect historical entities (people, organizations, places, historic events) and create networks. We will present some of the EHRI2 technical work, including critical issues we have encountered.
WP10 (EAD) converts archival descriptions from various formats to standard EAD XML; transports EADs using OAI PMH or ResourceSync; ingests EADs to the EHRI database; enables use cases such as synchronization; coreferencing of textual Access Points to proper thesaurus references
WP11 (Authorities and Standards) consolidates and enlarges the EHRI authorities to render the indexing and retrieval of information more effective. It addresses Access Points in ingested EADs (normalization of Unicode, spelling, punctuation; deduplication; clustering; coreferencing to authority control), Subjects (deployment of a Thesaurus Management System in support of the EHRI Thesaurus Editorial Board), Places (coreferencing to Geonames); Camps and Ghettos (integrating data with Wikidata); Persons, Corporate Bodies (using USHMM HSV and VIAF); semantic (conceptual) search including hierarchical query expansion; interconnectivity of archival descriptions; permanent URLs; metadata quality; EAD RelaxNG and Schematron schemas and validation, etc.
WP13 (Data Infrastructures) builds up domain knowledge bases from institutional databases by using deduplication, semantic data integration, semantic text analysis. It provides the foundation for research use cases on Jewish Social Networks and their impact on the chance of survival.
WP14 (Digital Historiography Research) works on semantic text analysis (semantic enrichment), text similarity (e.g. clustering based on Neural Networks, LDA, etc), geo-mapping. It develops Digital Historiography researcher tools, including Prosopographical approaches.
Acquisition of audiovisual Scientific Technical Information from OSGeo: A wor...Peter L旦we
油
This document discusses the Technische Informationsbibliothek (TIB), Germany's largest science and technology library. It was founded in 1959 in response to the Sputnik crisis to provide scientific and technical information. The TIB has expanded its role and services beyond text to include audiovisual content, research data, and software in line with open science principles. It is exploring how to acquire and provide access to open source geospatial (OSGeo) audiovisual content through its TIB|AV portal in a sustainable way through practices like digital object identifiers and long-term preservation. Demonstrators show how the content could be used for thematic video blogging and community mind mapping.
Reusing historical newspapers of KB in e-humanities - Case studies and exampl...Olaf Janssen
油
This slidedeck gives an overview of Dutch e-humanties projects that build upon the datasets of the Koninklijke Bibliotheek, the national library of the Netherlands.
It focuses on 8 projects that reuse the digitized historical newspapers (1618-1995) of the KB.
It was presented on 7-1-2014 at the Huygens Institute for the History of the Netherlands (Huygens ING for short). This is an institute of the Royal Netherlands Academy of Arts and Sciences (KNAW) where around 100 scholars work in the largest humanities institute of the Netherlands.
Keywords: biland,delpher,e-humanities,elite network shifts,hirods,historical newspapers,isher,koninklijke bibliotheek,national library of the netherlands,open data,polimedia,political mashup,reuse,sealincmedia,translantis,washp
The document discusses the PoliticalMashup project in the Netherlands, which aims to connect political data sources and make implicit network structures and information explicit. It describes how parliamentary proceedings data can be used to construct cooperation and opposition networks between politicians and parties. Various network graphs and longitudinal analyses are possible using the rich data available on the PoliticalMashup website.
Slow-cooked data and APIs in the world of Big Data: the view from a city per...Oscar Corcho
油
The document discusses slow-cooked data and APIs from a city perspective. It draws an analogy between big data/fast food and slow-cooked/linked open data. It outlines six rules for slow-cooking data: 1) appropriately segment datasets, 2) annotate data with semantics, 3) provide multiple data formats, 4) engage children in data contribution and use, 5) use open data internally before publishing, and 6) leverage common data structures for interoperability like fast food franchises do. The goal is to cook open data in a way that is both useful and reusable.
Makalah ini membahas tentang logika matematika dengan menjelaskan pengertian logika, pernyataan, kalimat terbuka, ingkaran, operasi-operasi dalam logika seperti konjungsi, disjungsi, implikasi, biimplikasi, serta tautologi, kontradiksi dan kontingen.
Keynote Exploring and Exploiting Official Publicationsmaartenmarx
油
This document discusses requirements and opportunities for opening up official documents like parliamentary proceedings. It argues that the value lies not in individual documents but in the relationships between documents over time. A political n-gram viewer application is proposed that would allow exploration of topics and language used by different political parties over decades. However, linking documents and extracting needed metadata like speaker affiliations is challenging and existing linked open data is not reliable enough. Official documents need to be self-describing and use shared standards and controlled vocabularies to be truly open and interoperable.
Building the PoliMedia search system; data- and user-drivenMaxKemman
油
Presentation at eHumanities group at Meerten's Institute (Amsterdam) on Thursday 18 April 2013.
Analysing media coverage across several types of media-outlets is a challenging task for (media) historians. A specific example of media coverage research investigates the coverage of political debates and how the representation of topics and people change over time. The PoliMedia project (http://www.polimedia.nl) aims to showcase the potential of cross-media analysis for research in the humanities, by 1) curating automatically detected semantic links between four data sets of different media types, and 2) developing a demonstrator application that allows researchers to deploy such an interlinked collection for quantitative and qualitative analysis of media coverage of debates in the Dutch parliament.
These two goals reflect the two perspectives on the development of a search system such as PoliMedia; data- and user-driven. In this presentation, Laura Hollink (VU) will present the data-driven perspective of linking between different datasets and the research questions that arise in achieving this linkage: how to combine different types of datasets and what kind of research questions are made possible by the data? Max Kemman (EUR) will present the user-driven perspective: which benefits can scholars have from linking of these datasets? What are the user requirements for the PoliMedia search system and how was the system evaluated with scholars in an eye tracking study?
1) The document discusses using open datasets for research purposes. It describes several open datasets including PoliMedia, which covers Dutch parliamentary debates, and Talk of Europe, which covers debates in the European Parliament.
2) Some challenges discussed include finding datasets that match research questions and determining what makes a dataset truly open. Collaboration with computer scientists may be needed.
3) The goals of using open datasets are described as both answering existing research questions and finding new research questions. Examples of analyses that could be done using the described datasets are provided.
Bringing parliamentary debates to the Semantic WebLaura Hollink
油
Presentation of the paper 'Bringing parliamentary debates to the Semantic Web' by Damir Juric, Laura Hollink and Geert-Jan Houben at the workshop on Detection, Representation, and Exploitation of Events in the Semantic Web (DeRiVE2012) in conjunction with the 11th International Semantic Web Conference 2012 in Boston, USA.
See also the homepage of the PoliMedia project: http://polimedia.nl/
The document discusses the Sense4us toolkit which aims to help policymakers make more informed decisions by analyzing social media, open data sources, and modeling policy problems. It describes the different components of the Sense4us toolkit, including tools for topic analysis of social media, sentiment analysis, cognitive mapping of policy issues, and simulation of policy options. The document also discusses challenges in using social media and open data to inform policymaking and demonstrates how Sense4us addresses these challenges through various case studies and examples.
Using Topic Modeling to Study Everyday "Civic Talk" and Proto-political Engag...Tuukka Yl辰-Anttila
油
We present a two-step topic modeling method of analysing political articulations in everyday proto-political "civic talk" on online social media and interpreting them in terms of cultural and political sociology.
Big data as a source for official statisticsEdwin de Jonge
油
This document discusses using big data as a source for official statistics and outlines some key challenges:
1. Big data is often noisy, dirty, and unstructured, requiring methods to extract useful information and reduce noise. Visualization tools help explore large datasets.
2. Big data sources are selective and contain events rather than full population coverage, requiring methods to convert events to units and correct for selectivity.
3. Beyond simple correlation, additional analysis is needed to establish causality between big data findings and other data sources.
4. Privacy and security laws must be followed, requiring anonymization of sensitive microdata or use of aggregates within a secure environment. Addressing these methodological and legal challenges will help realize
This document discusses using big data as a source for official statistics. It provides an overview of big data research at Statistics Netherlands and why visualization is used as an analysis tool. Some key challenges discussed include dealing with noisy and dirty data, addressing selectivity issues in big data sources, going beyond simple correlation, and addressing privacy and security concerns. Examples are provided of visualizing census and social security register data. The future potential of big data for statistics is acknowledged, though fundamental methodological, legal and technical issues still need resolution.
ECSM2014: Using Social Media To Inform Policy Making: To whom are we listenin...Miriam Fernandez
油
This document analyzes social media data related to policy topics collected from Twitter over a one week period. It finds that a small number of users, mainly news agencies and organizations, contribute the majority of content. The average user discussing policy on Twitter is more active than typical users. Discussion is geographically concentrated in regions with high population densities. A few topics, like privacy and minimum wage, received extensive discussion, while most topics were underrepresented. Sentiment analysis found that genetic engineering, immigration, and political donations received more negative sentiment, while privacy and fracking had more mixed positive and negative sentiment. The study is limited to one platform, language, time period and has scarce geo-location data for tweets.
WeGov Analysis Tools to connect Policy Makers with Citizens OnlineTimo Wandhoefer
油
The document summarizes the WeGov project, which aims to connect policy makers with citizens online using analysis tools. The project involves partners from several European countries. It is developing a toolbox of social media analysis tools to help policy makers understand public opinions and engage citizens. The toolbox will allow searching social networks, analyzing discussions to identify topics and opinions, and modeling user behaviors. It is being tested with governments and is expected to be finalized in September 2012.
Words and More Words: Challenges of Big Data by Prof. Edie Rasmussenwkwsci-research
油
Presented during the WKWSCI Symposium 2014
21 March 2014
Marina Bay Sands Expo and Convention Centre
Organized by the Wee Kim Wee School of Communication and Information at Nanyang Technological University
The document discusses two types of voting advice systems: the Lipschits method from 1977-1998 and the Stemwijzer method on the web from 1998. It presents a demo of "Lipschits on the web" called VerkiezingsKijker, which allows direct access to party manifestos. The main challenge is bridging the semantic gap between user search terms and manifesto language; the system addresses this using hierarchical controlled vocabularies and document expansion techniques like term harvesting. The conclusion discusses making both systems complementary and continuing to standardize controlled vocabularies and data.
Adding value to NLP: a little semantics goes a long wayDiana Maynard
油
This document discusses how natural language processing (NLP) and semantics can help address challenges in four domains: monitoring violations against journalists, disaster relief, scientometrics, and the "B-word." It provides examples of how NLP tools can extract and categorize information, link entities and events, geotag social media posts, and connect different data sources to provide a richer understanding of knowledge production. Semantic technologies like ontologies are presented as a way to coherently connect topics across document types and data sources.
The document discusses the library IT systems in Denmark. It describes the historical context of libraries having regional monopolies and using many different CMS systems. It then summarizes the key library IT systems: 1) The FBS library system which is web-based, modular, and owned by the libraries; 2) The IMS inventory management system which tracks materials locations and optimizes lending; 3) Websites, apps, and shared infrastructure developed using open source tools in a collaborative environment. Interfaces in the new Dokk1 library allow digital signage, wayfinding, and event information sharing.
The document provides an overview of data science, big data, data mining, and data mining techniques. It defines data science as a multi-disciplinary field that uses scientific methods to extract knowledge from structured and unstructured data. Big data is described as large, diverse datasets that are too large for traditional databases to handle. Common data mining tasks like prediction, classification, clustering and association rule mining are summarized. Finally, specific techniques like decision trees, k-means clustering, and association rule mining are overviewed.
ACMI and RMIT collaborated on a data visualization project using ACMI's collection of 40,000 moving image records. RMIT students created 56 infographics revealing patterns and correlations in the data. This exposed new marketing opportunities for ACMI and identified gaps to target for future acquisition. The project was concluded to be a success in making public knowledge more accessible and illuminating the richness of ACMI's collection.
The document outlines an assessment strategy for a course, including assignments such as an individual essay, group panel discussion, and personal research project that involves a literature review and proposal. Deadlines are provided for submitting assignments between November 2012 and September 2013. Guidance is given on topics and resources for the literature review portion of the personal research project.
Introduction to Research project PoliMediaMartijn Kleppe
油
Presentation about our research project 'PoliMedia - Interlinking multimedia for the analysis of media coverage of political debates'. Presented at the PoliMedia symposium, 23 January 2013, Amsterdam, the Netherlands
Beyond document retrieval using semantic annotations Roi Blanco
油
Traditional information retrieval approaches deal with retrieving full-text document as a response to a user's query. However, applications that go beyond the "ten blue links" and make use of additional information to display and interact with search results are becoming increasingly popular and adopted by all major search engines. In addition, recent advances in text extraction allow for inferring semantic information over particular items present in textual documents. This talks presents how enhancing a document with structures derived from shallow parsing is able to convey a different user experience in search and browsing scenarios, and what challenges we face as a consequence.
Integers and rationals can be expressed in first-order logic using only the less than relation <. The difference between integers and rationals can also be expressed in first-order logic. Specifically, integers have the property that between any two integers, there is no other integer, while this is not true for rationals - between any two rationals there is always another rational.
This document discusses principles for open government data and provides examples of how parliamentary proceedings in the Netherlands have been published according to these principles. It outlines principles from the W3C and Open Government Group for making data human and machine-readable, linked, standardized, and reusable. Recent parliamentary data from 1995 onward is available online, while older historical proceedings from 1814-1995 can also be accessed. The document encourages assessing existing parliamentary websites against these principles and fully implementing open data standards to maximize the value and novel insights that can be gained from the data.
Building the PoliMedia search system; data- and user-drivenMaxKemman
油
Presentation at eHumanities group at Meerten's Institute (Amsterdam) on Thursday 18 April 2013.
Analysing media coverage across several types of media-outlets is a challenging task for (media) historians. A specific example of media coverage research investigates the coverage of political debates and how the representation of topics and people change over time. The PoliMedia project (http://www.polimedia.nl) aims to showcase the potential of cross-media analysis for research in the humanities, by 1) curating automatically detected semantic links between four data sets of different media types, and 2) developing a demonstrator application that allows researchers to deploy such an interlinked collection for quantitative and qualitative analysis of media coverage of debates in the Dutch parliament.
These two goals reflect the two perspectives on the development of a search system such as PoliMedia; data- and user-driven. In this presentation, Laura Hollink (VU) will present the data-driven perspective of linking between different datasets and the research questions that arise in achieving this linkage: how to combine different types of datasets and what kind of research questions are made possible by the data? Max Kemman (EUR) will present the user-driven perspective: which benefits can scholars have from linking of these datasets? What are the user requirements for the PoliMedia search system and how was the system evaluated with scholars in an eye tracking study?
1) The document discusses using open datasets for research purposes. It describes several open datasets including PoliMedia, which covers Dutch parliamentary debates, and Talk of Europe, which covers debates in the European Parliament.
2) Some challenges discussed include finding datasets that match research questions and determining what makes a dataset truly open. Collaboration with computer scientists may be needed.
3) The goals of using open datasets are described as both answering existing research questions and finding new research questions. Examples of analyses that could be done using the described datasets are provided.
Bringing parliamentary debates to the Semantic WebLaura Hollink
油
Presentation of the paper 'Bringing parliamentary debates to the Semantic Web' by Damir Juric, Laura Hollink and Geert-Jan Houben at the workshop on Detection, Representation, and Exploitation of Events in the Semantic Web (DeRiVE2012) in conjunction with the 11th International Semantic Web Conference 2012 in Boston, USA.
See also the homepage of the PoliMedia project: http://polimedia.nl/
The document discusses the Sense4us toolkit which aims to help policymakers make more informed decisions by analyzing social media, open data sources, and modeling policy problems. It describes the different components of the Sense4us toolkit, including tools for topic analysis of social media, sentiment analysis, cognitive mapping of policy issues, and simulation of policy options. The document also discusses challenges in using social media and open data to inform policymaking and demonstrates how Sense4us addresses these challenges through various case studies and examples.
Using Topic Modeling to Study Everyday "Civic Talk" and Proto-political Engag...Tuukka Yl辰-Anttila
油
We present a two-step topic modeling method of analysing political articulations in everyday proto-political "civic talk" on online social media and interpreting them in terms of cultural and political sociology.
Big data as a source for official statisticsEdwin de Jonge
油
This document discusses using big data as a source for official statistics and outlines some key challenges:
1. Big data is often noisy, dirty, and unstructured, requiring methods to extract useful information and reduce noise. Visualization tools help explore large datasets.
2. Big data sources are selective and contain events rather than full population coverage, requiring methods to convert events to units and correct for selectivity.
3. Beyond simple correlation, additional analysis is needed to establish causality between big data findings and other data sources.
4. Privacy and security laws must be followed, requiring anonymization of sensitive microdata or use of aggregates within a secure environment. Addressing these methodological and legal challenges will help realize
This document discusses using big data as a source for official statistics. It provides an overview of big data research at Statistics Netherlands and why visualization is used as an analysis tool. Some key challenges discussed include dealing with noisy and dirty data, addressing selectivity issues in big data sources, going beyond simple correlation, and addressing privacy and security concerns. Examples are provided of visualizing census and social security register data. The future potential of big data for statistics is acknowledged, though fundamental methodological, legal and technical issues still need resolution.
ECSM2014: Using Social Media To Inform Policy Making: To whom are we listenin...Miriam Fernandez
油
This document analyzes social media data related to policy topics collected from Twitter over a one week period. It finds that a small number of users, mainly news agencies and organizations, contribute the majority of content. The average user discussing policy on Twitter is more active than typical users. Discussion is geographically concentrated in regions with high population densities. A few topics, like privacy and minimum wage, received extensive discussion, while most topics were underrepresented. Sentiment analysis found that genetic engineering, immigration, and political donations received more negative sentiment, while privacy and fracking had more mixed positive and negative sentiment. The study is limited to one platform, language, time period and has scarce geo-location data for tweets.
WeGov Analysis Tools to connect Policy Makers with Citizens OnlineTimo Wandhoefer
油
The document summarizes the WeGov project, which aims to connect policy makers with citizens online using analysis tools. The project involves partners from several European countries. It is developing a toolbox of social media analysis tools to help policy makers understand public opinions and engage citizens. The toolbox will allow searching social networks, analyzing discussions to identify topics and opinions, and modeling user behaviors. It is being tested with governments and is expected to be finalized in September 2012.
Words and More Words: Challenges of Big Data by Prof. Edie Rasmussenwkwsci-research
油
Presented during the WKWSCI Symposium 2014
21 March 2014
Marina Bay Sands Expo and Convention Centre
Organized by the Wee Kim Wee School of Communication and Information at Nanyang Technological University
The document discusses two types of voting advice systems: the Lipschits method from 1977-1998 and the Stemwijzer method on the web from 1998. It presents a demo of "Lipschits on the web" called VerkiezingsKijker, which allows direct access to party manifestos. The main challenge is bridging the semantic gap between user search terms and manifesto language; the system addresses this using hierarchical controlled vocabularies and document expansion techniques like term harvesting. The conclusion discusses making both systems complementary and continuing to standardize controlled vocabularies and data.
Adding value to NLP: a little semantics goes a long wayDiana Maynard
油
This document discusses how natural language processing (NLP) and semantics can help address challenges in four domains: monitoring violations against journalists, disaster relief, scientometrics, and the "B-word." It provides examples of how NLP tools can extract and categorize information, link entities and events, geotag social media posts, and connect different data sources to provide a richer understanding of knowledge production. Semantic technologies like ontologies are presented as a way to coherently connect topics across document types and data sources.
The document discusses the library IT systems in Denmark. It describes the historical context of libraries having regional monopolies and using many different CMS systems. It then summarizes the key library IT systems: 1) The FBS library system which is web-based, modular, and owned by the libraries; 2) The IMS inventory management system which tracks materials locations and optimizes lending; 3) Websites, apps, and shared infrastructure developed using open source tools in a collaborative environment. Interfaces in the new Dokk1 library allow digital signage, wayfinding, and event information sharing.
The document provides an overview of data science, big data, data mining, and data mining techniques. It defines data science as a multi-disciplinary field that uses scientific methods to extract knowledge from structured and unstructured data. Big data is described as large, diverse datasets that are too large for traditional databases to handle. Common data mining tasks like prediction, classification, clustering and association rule mining are summarized. Finally, specific techniques like decision trees, k-means clustering, and association rule mining are overviewed.
ACMI and RMIT collaborated on a data visualization project using ACMI's collection of 40,000 moving image records. RMIT students created 56 infographics revealing patterns and correlations in the data. This exposed new marketing opportunities for ACMI and identified gaps to target for future acquisition. The project was concluded to be a success in making public knowledge more accessible and illuminating the richness of ACMI's collection.
The document outlines an assessment strategy for a course, including assignments such as an individual essay, group panel discussion, and personal research project that involves a literature review and proposal. Deadlines are provided for submitting assignments between November 2012 and September 2013. Guidance is given on topics and resources for the literature review portion of the personal research project.
Introduction to Research project PoliMediaMartijn Kleppe
油
Presentation about our research project 'PoliMedia - Interlinking multimedia for the analysis of media coverage of political debates'. Presented at the PoliMedia symposium, 23 January 2013, Amsterdam, the Netherlands
Beyond document retrieval using semantic annotations Roi Blanco
油
Traditional information retrieval approaches deal with retrieving full-text document as a response to a user's query. However, applications that go beyond the "ten blue links" and make use of additional information to display and interact with search results are becoming increasingly popular and adopted by all major search engines. In addition, recent advances in text extraction allow for inferring semantic information over particular items present in textual documents. This talks presents how enhancing a document with structures derived from shallow parsing is able to convey a different user experience in search and browsing scenarios, and what challenges we face as a consequence.
Integers and rationals can be expressed in first-order logic using only the less than relation <. The difference between integers and rationals can also be expressed in first-order logic. Specifically, integers have the property that between any two integers, there is no other integer, while this is not true for rationals - between any two rationals there is always another rational.
This document discusses principles for open government data and provides examples of how parliamentary proceedings in the Netherlands have been published according to these principles. It outlines principles from the W3C and Open Government Group for making data human and machine-readable, linked, standardized, and reusable. Recent parliamentary data from 1995 onward is available online, while older historical proceedings from 1814-1995 can also be accessed. The document encourages assessing existing parliamentary websites against these principles and fully implementing open data standards to maximize the value and novel insights that can be gained from the data.
The document describes a two-stage approach to named entity recognition for Dutch text. In the first recognition stage, a classifier labels tokens as the beginning, inside, or outside of an entity based on features like surrounding words and capitalization. In the second classification stage, entity spans identified in the first stage are classified into types like person, location or organization using additional features about the entity text, context and capitalization patterns. The approach uses averaged perceptrons trained on custom feature sets at each stage to recognize and classify named entities in Dutch language documents.
Women in Dutch parliament: what they didmaartenmarx
油
The document analyzes the role of women MPs in the Dutch Parliament from 1918 to 2012. It divides this period into four descriptions: Pioneers (1918-1948), Tokens (1948-1977), Defenders (1977-1994), and Players (1994-2012). It presents data on the percentage of speech acts by women MPs over time and normalized by the percentage of women MPs. It also examines the subject areas women MPs spoke about by looking at the government ministers present in debates. Tables show the topical inclination of women MPs in different eras, indicating their likelihood to speak on certain issues.
This document summarizes a presentation on analyzing political slant in Dutch public broadcasting. The researchers aimed to apply the methodology of Gentzkow and Shapiro (2010) to the Dutch situation. They collected subtitles for Dutch TV broadcasts and analyzed the relative frequencies of characteristic words for different political groups to estimate the probability a broadcast used language characteristic of each group. Preliminary results found talk shows and news broadcasts rarely used language characteristic of right-wing groups. The researchers plan to further develop this work into a bachelor's thesis and future collaboration with political scientists.
1. PoliticalMashup 1
PoliticalMashup
Connecting promises and actions of politicians and how
the society reacts on them
Maarten Marx
Universiteit van Amsterdam
Groningen, 留-informatica, 2011-03-11
2. PoliticalMashup 2
Content
Overview PoliticalMashup project
Zooming in on one cultural heritage dataset
A few example applications
Research ideas for NLP-scientists.
3. PoliticalMashup 3
Who am I?
Political scientist turned computer scientist
My 鍖eld:
Theory of XML Database Systems
Semi Structured Information Retrieval
Cooperation with
Tweede Kamer
Koninklijke Bibliotheek,
historians at NIOD, DNPP
4. PoliticalMashup 4
PoliticalMashup project
Large scale data integration project
2 years NWO funded infrastructure project 2010-2012
Partners: U. Amsterdam, Groningen and Tilburg
Ongoing with irregular funding since 2008
5. PoliticalMashup 5
Goal of PoliticalMashup
Making huge amounts of textual data available for
large scale automatic quantitative data and content analysis
done by scientists from the humanities and social sciences.
6. PoliticalMashup 6
Mashup of what and how?
4 data sources
Promises and actions of politicians
Reactions on those in media and general public
Connect data on
Political entities
Time
Topics
7. PoliticalMashup 7
Data sources
Promises
Election manifestos, mostly scans, DNPP
Party websites and blogs, Archipol
Twitter of politicians
Actions Parliamentary proceedings, mostly scans, KB
Reactions
News media
User generated content Fora, Blogs, Comments on news,
Twitter
8. PoliticalMashup 8
Used techniques
Text analytics and XML DB and IR technology
Named entity recognition and normalization
Data mining, Machine Learning, hand-crafted rules
Natural Language Processing, Language Models
Make implicit structure and information explicit.
16. PoliticalMashup 16
De Handelingen der Staten Generaal (Dutch
Hansards)
17. PoliticalMashup 17
About this collection
very sparse available metadata
very rich metadata sits hidden inside the raw data
Rich data model
Meeting (1 Day)
Topic
Stage direction
Scene
Stage direction
Speech
Paragraph
18. PoliticalMashup 18
Same data: di鍖erent views
Raw data in PDF
XML styled with stylesheet
Machine readable XML format
20. PoliticalMashup 20
Content and structure search
Combine IR style keyword search with restrictions on structure.
E.g., return speeches by Wilders about Islam
21. PoliticalMashup 21
Exhaustive data collection
Example query for NIOD historians
Search for paragraphs about fascisme OR nazisme OR dictatuur
OR (nazi AND dictatuur) OR . . .
Return a tsv 鍖le with for each hit date speakername speakerid
speaker-party . . .
NIOD query
22. PoliticalMashup 22
Link the proceedings to entities
Who is speaking?
Who says what to whom?
Applications
Summary of one speaker
On old OCRed data: Linking and resolving entities
23. PoliticalMashup 23
Application: Interruption graph (Attackogram)
MP A interrupts B A speaks during the block of B.
25. PoliticalMashup 25
0) Topics
Common European thesaurus http://eurovoc.europa.eu
detection
classi鍖cation (sentence, paragraph, speech level)
26. PoliticalMashup 26
1) Populist language in parliament
PhD Thesis Jan Jagers (2006).
27. PoliticalMashup 27
2) Automatically detecting promises (toezegging)
by ministers in Parliament
https:
//zoek.officielebekendmakingen.nl/kst-103196.pdf
(pagina 56)
Eerste Kamer has a nice database online
http://www.eerstekamer.nl/toezeggingen_2
28. PoliticalMashup 28
Example
De voorzitter: Ik constateer dat wij bijna aan het einde van deze
vergadering zijn gekomen. Wij hebben nog tijd om even de
toezeggingen langs te lopen. Ik vraag iedereen om op te letten of er
niets over het hoofd is gezien. Ik zal dit snel doen en daarna spreken
wij nog even over het vervolg. De toezeggingen.
Na de zomer ligt het wetsvoorstel bij de Kamer.
Er komt een brief om de Kamer erover te informeren op welke wijze
er voorkomen wordt dat er expertise verloren gaat.
Minister Van Bijsterveldt-Vliegenthart: Dat heb ik niet
toegezegd. Beslist niet. Nee, dat doe ik niet, want ik heb dat niet
toegezegd.
29. PoliticalMashup 29
3) Opinion detection
Detect opinions expressed about entities and topics. (Speaker is
known)
Detect reported speech.
30. PoliticalMashup 30
4) Detect type of speech
Interruption, attack, answer, speech (betoog), stage-direction,
...
http://data.politicalmashup.nl/debates/nl/
h-ek-19961997-37-58.1-tijdslijn.html
31. PoliticalMashup 31
5) Detect bullshit
Tautologi即en . . .
e
Regels zijn regels, Op is op
pp
het is wat het is
32. PoliticalMashup 32
6) Spelling normalization
Dutch had many spelling reforms.
Leads to lower recall.
Search in new spelling, return results in old spellings.
33. PoliticalMashup 33
Lots of data available: happy to share
Now: 15 years of Dutch Parliamentary Proceedings in rich XML
Now: 200 years more in poorer XML, slowly getting richer.
Parliamentary proceedings from EU (15y), UK (75y), Spain (40y),
Scandinavian countries, . . .
Election manifestos (provincial elections 2007 and 2011)
All tweets, blogs, Flickr and Youtube of all Dutch national
politicians since 1.5 year.