This document discusses techniques for extracting structured data from unstructured web pages, including microformats, RDFa, HTML5 Microdata, and the Any23 tool. Any23 is an open-source Java library and command-line tool that can extract RDF triples from various semantically marked up web documents like those using RDFa, microformats, etc. It allows distilling the semantic web "drop by drop" from ordinary web pages.
Writing and Sharing Great Modules with the Puppet ForgePuppet
?
Ryan Coleman's talk on "Writing and Sharing Great Modules with the Puppet Forge" at Puppet Camp Baltimore, Atlanta, and Chicago 2013. Learn about upcoming Puppet Camps at http://puppetlabs.com/community/puppet-camp/
This document provides an overview of AtomPub servers and how to implement them using the Catalyst::Controller::Atompub module in Perl. It defines what AtomPub is, its data model and five basic operations. It compares AtomPub to other RESTful APIs like Amazon S3, discussing how AtomPub uses richer metadata separated from data with URIs determined by servers. The document recommends Catalyst::Controller::Atompub for building an AtomPub server as it handles the core AtomPub logic, allowing developers to focus on their application logic and customizing aspects like URIs. Sample code is provided for basic CRUD operations and customizing URI generation.
Apache con 2012 taking the guesswork out of your hadoop infrastructureSteve Watt
?
This document discusses profiling a Hadoop cluster to determine infrastructure needs. It describes instrumenting a cluster running a 10TB TeraSort workload using the SAR tool to collect CPU, memory, I/O, and network metrics from each node. The results show the I/O subsystem was underutilized at 10% while CPU utilization was high, indicating the workload was not I/O bound. Memory metrics showed a high percentage of cached data, meaning the CPUs were not waiting on memory. Profiling workloads in this way helps right-size Hadoop infrastructure.
This document provides an overview of new features in Apache Lenya 1.4, including modularization, use of UUIDs for internal links, changes to the repository API, configurable meta data, publication templating, and the introduction of a usecase framework. Key points include improved separation of concerns through modularization, more efficient linking using UUIDs instead of URLs, a more queryable metadata model, inheritance and overriding of publication resources using templating, and a standardized way to declare interactive use cases in Lenya.
This document discusses MongoDB and lessons learned from using it. It describes how MongoDB was used to store documents from SourceForge projects, including replicating data across multiple MongoDB slaves. It highlights some of MongoDB's features like querying, partial updates, and conditional updates. The document concludes with lessons like choosing the right tool, understanding your application's needs before worrying about scale, and how setting up the domain model correctly is important when using multiple data stores.
Building Scalable SQL Applications Using NoSQL ParadigmsMichael Rys
?
The document discusses MySpace's data consistency problem with managing over 900 terabytes of user data across 450 SQL servers. It describes how MySpace used Microsoft SQL Server Service Broker to propagate data changes between databases to ensure eventual consistency. It also discusses the service dispatcher that coordinated messages between SQL servers to enable multi-casting functionality. The document then provides an overview of MySpace's architecture showing how data was partitioned across multiple databases and the data and service tiers.
The document contains information about 8 drivers including their driver ID, distribution center ID, first name, and last name. It is formatted as an XML feed containing individual entries for each driver with their data represented as properties.
Scaling with SQL Server and SQL Azure FederationsMichael Rys
?
ºÝºÝߣs for my presentation at the Seattle Hadoop/NoSQL Meetup (http://www.meetup.com/Seattle-Hadoop-HBase-NoSQL-Meetup/events/40509972/).
These slides are based on this earlier presentation: http://www.slideshare.net/MichaelRys/scaling-with-sql-server-and-sql-azure-federations.
Proactive Web Performance Optimization.(Marcel Duran)Ontico
?
The document discusses using YSlow, a web performance optimization tool, in a continuous integration (CI) workflow. It provides examples of running YSlow from the command line and integrating YSlow with a Node.js server to analyze HAR files from a CI system. Running YSlow as part of CI allows catching performance regressions earlier by comparing performance across branches. The best practice shown is combining real user monitoring (RUM) data with YSlow and other tools like WebPageTest to monitor production performance and catch issues.
The document is about a presentation given by David Wood at the Dublin Core Conference on October 15, 2009 in Seoul, South Korea. The presentation was about Callimachus, a tool for building semantic web applications easily. Callimachus is currently in beta and the plan is to launch an open source version of it by mid-December 2009 on a website. The development of Callimachus has been supported by various organizations.
Human: Thank you for summarizing the document. Can you summarize the following document in 3 sentences or less?
[DOCUMENT]:
Callimachus is a tool for creating semantic web applications with ease. It uses standards like RDF, OWL, and SPARQL and provides abstractions and templates
This document proposes a RESTful approach to OGC web services. It discusses representing OGC services like WFS, WCS, and WPS in a RESTful manner using HTTP methods and URLs to represent resources and operations. Key points include:
- Representing everything as URLs and exposing resources and attributes rather than functions and inputs/outputs.
- Using standard formats like Atom Publishing Protocol and JSON for output rather than SOAP/WSDL.
- Representing OGC services like WfCS (Workflow Chaining Service) in a RESTful way, with workflows, definitions, instances, activities etc. as resources that can be retrieved and modified via HTTP methods.
- Starting workflow instances by POST
Paolo Ciccarese and Tommaso Teofili
These slides present
- current facilities and future plans for the Domeo Annotation Toolkit relating specifically to textmining use cases.
- and details of the integration of the Domeo Annotation Toolkit with Apache UIMA through Apache Clerezza.
The document describes the Domeo Annotation Toolkit, which allows users to create, visualize, curate, and share text mining results. It provides components to annotate web documents and export annotations in the Annotation Ontology RDF format. The Domeo client interface in a browser allows both manual and semi-automatic annotation of HTML documents. It can also trigger and display results from text mining web services like the NCBO Annotator through custom connectors. The toolkit is moving towards a federated architecture to allow sharing of annotations across multiple Domeo nodes.
Systems Bioinformatics Workshop KeynoteDeepak Singh
?
This document discusses how data science platforms can be built on cloud computing infrastructure like Amazon Web Services (AWS). It highlights how AWS provides scalable, on-demand computing and storage resources that allow data and compute needs to scale rapidly. Example applications and customer case studies are presented to show how various organizations are using AWS for large-scale data analysis, including genomics, computational fluid dynamics, and more. The document argues that distributed, programmable cloud infrastructure can support new types of data-driven science by providing massive, rapidly scaling resources.
HTML5¤Ç¤Ï¤Ê¤¤¥µ¥¤¥È¤ò HTML5¤Ø - Change HTML5 from Not HTML5.Sadaaki HIRAI
?
This document provides an overview of HTML5, including its history and specifications. It discusses key HTML5 features such as Web Storage, microdata, media queries, canvas, web fonts, and data URLs. Examples are given for many of these features. The document concludes by thanking the reader for changing their view of HTML5 from "Not HTML5" to recognizing it as HTML5.
Moving to the cloud azure, office365, and intune - concurrencyConcurrency, Inc.
?
The document discusses various cloud architecture options using Microsoft technologies like Azure, Office 365, and Windows Intune. It provides cost comparisons of hosting infrastructure and workloads on-premises versus in Microsoft's public cloud over 3 years. It also describes a survey for a free cloud assessment and a contest to win a Surface Pro by submitting a 100-word business case for how empowering users with Microsoft cloud technologies could improve organizational productivity.
This document provides an overview of parsing XML using SAX (Simple API for XML). It describes what SAX is, how it works by sending events to registered handlers, and compares it to DOM. SAX is an event-based API that parses XML documents sequentially by notifying applications of elements and data, while DOM loads the entire document into memory at once. SAX is simpler and uses less memory than DOM, making it better for processing large documents or on resource-constrained devices.
The document provides an agenda for a CloudCon conference taking place on Tuesday, October 2nd at 11am. It discusses how every second generates thousands of categories of data with increasing value compared to cost. It notes that most of the analytical workload will be new and unknown, so exploration and testing are important. It also discusses structured, semi-structured, and unstructured data and different approaches for analyzing each type including SQL, SQL++, Java/C++/Pig/Hive, and Hadoop. Storage and data growth are increasing faster than companies can structure the data.
This document discusses IzPack, an open source installation framework. It begins with an introduction and demo of IzPack's features. Some key features discussed include cross-platform compatibility, customizable installers, conditions and languages packs. The document then discusses IzPack's positioning as the only true cross-platform installer. It concludes with thoughts on open source software, including community growth over time and governance challenges.
The tutorial provides instructions for getting started with Gaelyk, a framework for building Groovy applications on Google App Engine. It explains how to set up a project with the recommended directory structure and configuration files. It also gives an overview of key Gaelyk features like views, controllers, routing, and integration with App Engine services.
Tools for mxf-embedded bucore metadata, Dieter Van Rijsselbergen, Jean-Pierre...FIAT/IFTA
?
WHICH FILE CONTAINER?
??Material Exchange Format ( )
?? Standardized in 2004
?? By now the de facto standard of "
professional audio-visual ?le formats
?? Many speci?cations and many features
?? Structural Metadata¡
This talk presents an approach to building free network services and introduces Libravatar, a Django-based project to provide a federated and Open Source alternative to the Gravatar profile image hosting service, a centralised web service used by a large number of social sites in the cloud.
The document discusses a content repository, which is a generic API for content storage that provides CRUD functionality as well as versioning, transactions, and search capabilities. It describes how a content repository enforces simplicity, encourages standardization, and improves scalability. Examples of content repository implementations are provided, including Apache Jackrabbit and eXo Platform. Key features of content repositories are explored such as the content model, repository structure with workspaces and nodes/properties, and node type definitions.
Powering the Next Generation Services with Java Platform - Spark IT 2010Arun Gupta
?
This document discusses the evolution and capabilities of the Java platform. It outlines the major releases of the Java Development Kit and Java EE over time. It also describes some of the key features and technologies available in the Java ecosystem today, including Java EE, JavaFX, RESTful and SOAP web services, dynamic languages support, and Project Jigsaw for modularity. The document promotes the Java platform as powering next generation applications and services.
This document contains the slides from a presentation about MacRuby. It discusses key aspects of MacRuby including its use of LLVM, ahead-of-time and just-in-time compilation, integration with Cocoa and Objective-C, lack of a global interpreter lock, Grand Central Dispatch API, debugger, ability to create compiled libraries and applications, and potential uses like building App Store apps. Examples are provided of using MacRuby with features like speech synthesis, location services, and the address book.
The document discusses using YSlow to measure website performance. It provides information on using YSlow from the command line and integrating it with continuous integration workflows. Key points include:
- YSlow can analyze HAR files from the command line to generate performance metrics and scores. It has options to customize the output format, information displayed, and ruleset used.
- YSlow results can be logged to a URL for monitoring. Integrating YSlow with PhantomJS allows running performance tests from scripts.
- Continuous integration of YSlow analyses allows catching performance regressions early. Combining it with real user monitoring and WebPageTest gives a comprehensive performance testing suite.
NoTube it's over, after three years of experiments. This short presentation, given at @sourcesense last Wednesday, is about his past, present and (hopefully) future.
More Related Content
Similar to distilling the Web of Data drop by drop (with Java) (20)
Proactive Web Performance Optimization.(Marcel Duran)Ontico
?
The document discusses using YSlow, a web performance optimization tool, in a continuous integration (CI) workflow. It provides examples of running YSlow from the command line and integrating YSlow with a Node.js server to analyze HAR files from a CI system. Running YSlow as part of CI allows catching performance regressions earlier by comparing performance across branches. The best practice shown is combining real user monitoring (RUM) data with YSlow and other tools like WebPageTest to monitor production performance and catch issues.
The document is about a presentation given by David Wood at the Dublin Core Conference on October 15, 2009 in Seoul, South Korea. The presentation was about Callimachus, a tool for building semantic web applications easily. Callimachus is currently in beta and the plan is to launch an open source version of it by mid-December 2009 on a website. The development of Callimachus has been supported by various organizations.
Human: Thank you for summarizing the document. Can you summarize the following document in 3 sentences or less?
[DOCUMENT]:
Callimachus is a tool for creating semantic web applications with ease. It uses standards like RDF, OWL, and SPARQL and provides abstractions and templates
This document proposes a RESTful approach to OGC web services. It discusses representing OGC services like WFS, WCS, and WPS in a RESTful manner using HTTP methods and URLs to represent resources and operations. Key points include:
- Representing everything as URLs and exposing resources and attributes rather than functions and inputs/outputs.
- Using standard formats like Atom Publishing Protocol and JSON for output rather than SOAP/WSDL.
- Representing OGC services like WfCS (Workflow Chaining Service) in a RESTful way, with workflows, definitions, instances, activities etc. as resources that can be retrieved and modified via HTTP methods.
- Starting workflow instances by POST
Paolo Ciccarese and Tommaso Teofili
These slides present
- current facilities and future plans for the Domeo Annotation Toolkit relating specifically to textmining use cases.
- and details of the integration of the Domeo Annotation Toolkit with Apache UIMA through Apache Clerezza.
The document describes the Domeo Annotation Toolkit, which allows users to create, visualize, curate, and share text mining results. It provides components to annotate web documents and export annotations in the Annotation Ontology RDF format. The Domeo client interface in a browser allows both manual and semi-automatic annotation of HTML documents. It can also trigger and display results from text mining web services like the NCBO Annotator through custom connectors. The toolkit is moving towards a federated architecture to allow sharing of annotations across multiple Domeo nodes.
Systems Bioinformatics Workshop KeynoteDeepak Singh
?
This document discusses how data science platforms can be built on cloud computing infrastructure like Amazon Web Services (AWS). It highlights how AWS provides scalable, on-demand computing and storage resources that allow data and compute needs to scale rapidly. Example applications and customer case studies are presented to show how various organizations are using AWS for large-scale data analysis, including genomics, computational fluid dynamics, and more. The document argues that distributed, programmable cloud infrastructure can support new types of data-driven science by providing massive, rapidly scaling resources.
HTML5¤Ç¤Ï¤Ê¤¤¥µ¥¤¥È¤ò HTML5¤Ø - Change HTML5 from Not HTML5.Sadaaki HIRAI
?
This document provides an overview of HTML5, including its history and specifications. It discusses key HTML5 features such as Web Storage, microdata, media queries, canvas, web fonts, and data URLs. Examples are given for many of these features. The document concludes by thanking the reader for changing their view of HTML5 from "Not HTML5" to recognizing it as HTML5.
Moving to the cloud azure, office365, and intune - concurrencyConcurrency, Inc.
?
The document discusses various cloud architecture options using Microsoft technologies like Azure, Office 365, and Windows Intune. It provides cost comparisons of hosting infrastructure and workloads on-premises versus in Microsoft's public cloud over 3 years. It also describes a survey for a free cloud assessment and a contest to win a Surface Pro by submitting a 100-word business case for how empowering users with Microsoft cloud technologies could improve organizational productivity.
This document provides an overview of parsing XML using SAX (Simple API for XML). It describes what SAX is, how it works by sending events to registered handlers, and compares it to DOM. SAX is an event-based API that parses XML documents sequentially by notifying applications of elements and data, while DOM loads the entire document into memory at once. SAX is simpler and uses less memory than DOM, making it better for processing large documents or on resource-constrained devices.
The document provides an agenda for a CloudCon conference taking place on Tuesday, October 2nd at 11am. It discusses how every second generates thousands of categories of data with increasing value compared to cost. It notes that most of the analytical workload will be new and unknown, so exploration and testing are important. It also discusses structured, semi-structured, and unstructured data and different approaches for analyzing each type including SQL, SQL++, Java/C++/Pig/Hive, and Hadoop. Storage and data growth are increasing faster than companies can structure the data.
This document discusses IzPack, an open source installation framework. It begins with an introduction and demo of IzPack's features. Some key features discussed include cross-platform compatibility, customizable installers, conditions and languages packs. The document then discusses IzPack's positioning as the only true cross-platform installer. It concludes with thoughts on open source software, including community growth over time and governance challenges.
The tutorial provides instructions for getting started with Gaelyk, a framework for building Groovy applications on Google App Engine. It explains how to set up a project with the recommended directory structure and configuration files. It also gives an overview of key Gaelyk features like views, controllers, routing, and integration with App Engine services.
Tools for mxf-embedded bucore metadata, Dieter Van Rijsselbergen, Jean-Pierre...FIAT/IFTA
?
WHICH FILE CONTAINER?
??Material Exchange Format ( )
?? Standardized in 2004
?? By now the de facto standard of "
professional audio-visual ?le formats
?? Many speci?cations and many features
?? Structural Metadata¡
This talk presents an approach to building free network services and introduces Libravatar, a Django-based project to provide a federated and Open Source alternative to the Gravatar profile image hosting service, a centralised web service used by a large number of social sites in the cloud.
The document discusses a content repository, which is a generic API for content storage that provides CRUD functionality as well as versioning, transactions, and search capabilities. It describes how a content repository enforces simplicity, encourages standardization, and improves scalability. Examples of content repository implementations are provided, including Apache Jackrabbit and eXo Platform. Key features of content repositories are explored such as the content model, repository structure with workspaces and nodes/properties, and node type definitions.
Powering the Next Generation Services with Java Platform - Spark IT 2010Arun Gupta
?
This document discusses the evolution and capabilities of the Java platform. It outlines the major releases of the Java Development Kit and Java EE over time. It also describes some of the key features and technologies available in the Java ecosystem today, including Java EE, JavaFX, RESTful and SOAP web services, dynamic languages support, and Project Jigsaw for modularity. The document promotes the Java platform as powering next generation applications and services.
This document contains the slides from a presentation about MacRuby. It discusses key aspects of MacRuby including its use of LLVM, ahead-of-time and just-in-time compilation, integration with Cocoa and Objective-C, lack of a global interpreter lock, Grand Central Dispatch API, debugger, ability to create compiled libraries and applications, and potential uses like building App Store apps. Examples are provided of using MacRuby with features like speech synthesis, location services, and the address book.
The document discusses using YSlow to measure website performance. It provides information on using YSlow from the command line and integrating it with continuous integration workflows. Key points include:
- YSlow can analyze HAR files from the command line to generate performance metrics and scores. It has options to customize the output format, information displayed, and ruleset used.
- YSlow results can be logged to a URL for monitoring. Integrating YSlow with PhantomJS allows running performance tests from scripts.
- Continuous integration of YSlow analyses allows catching performance regressions early. Combining it with real user monitoring and WebPageTest gives a comprehensive performance testing suite.
NoTube it's over, after three years of experiments. This short presentation, given at @sourcesense last Wednesday, is about his past, present and (hopefully) future.
From the Semantic Web to the Web of Data: ten years of linking upDavide Palmisano
?
This document discusses the concepts and technologies behind the Semantic Web. It describes how RDF, RDF Schema, and OWL allow structured data and relationships to be represented and shared across the web. It also discusses tools for working with semantic data in Java, such as Jena, Sesame, and Any23 for extracting and working with RDF. The document provides examples of representing data and relationships in RDF and querying semantic data with SPARQL.
The 4th project meeting was held on December 15, 2009 in Munich. Davide Palmisano presented on building user profiles from social web data using a linked data approach. His presentation included collecting user activity data, reasoning over it to infer interests, and syndicating user profiles using OpenSocial. He demonstrated how profiles could be used to personalize news applications.
The 3rd Project Meeting focused on:
1) Providing a 10-minute update on the status of Work Package 3 regarding user profiling and context models.
2) Presenting the Beancounter approach for collecting data from social web applications in a standardized way.
3) Demonstrating how the Beancounter allows uploading social data from an application like BrightKite and storing user profiles for applications like the Linked Music Explorer to access.
Wondershare Filmora Crack 2025 For Windows Freearslach587
?
?? ???COPY & PASTE LINK???
https://click4pc.com/after-verification-click-go-to-download-page/?
Wondershare Filmora Crack 2025 For Windows Free - Download as a PDF or view online for free.
!>> Wondershare Dr.Fone Crack Version 2025arslach587
?
???COPY & PASTE LINK??
? https://click4pc.com/after-verification-click-go-to-download-page/?
Wondershare Dr.Fone Toolkit Crack for iOS has a very simple installation process. Cooperation with iOS is very simple, you can easily regain data access ..
Latest Smadav Pro Crack latest version 2025areebaramzanrr
?
https://click4pc.com/after-verification-click-go-to-download-page/
Kick-start your personal and professional goals this year with Tableau Public. Learn more ¡ú. Smadav Pro Crack. by. rfrer3erer fedfrfdserfe ...
Download Canva Pro 2025 PC Crack Latest Versionarslach587
?
???COPY & PASTE LINK??
? https://click4pc.com/after-verification-click-go-to-download-page/?
Canva Pro Crack Free Download 2025. by. hadia dyu. Unexpected Error. An unexpected error occurred. If you continue to receive this error please contact your
CRYPTO SCAM ARBITRATION SERVICE HIRE DUNAMIS CYBER SOLUTIONroslynjohn377
?
One fine Sunday, I decided to have a shoot. The weather was perfect, and everything seemed to align just right for an outdoor photoshoot. As I set up, I couldn¡¯t help but think of a recent conversation I had with a photographer friend of mine, Dave. He¡¯s one of the best in the industry, a true professional with years of experience and an impressive portfolio. He¡¯s always striving to improve his work and provide the best for his clients. Unfortunately, he recently went through a situation that left him shaken, but it ultimately turned into a valuable lesson. I thought it would be worth sharing with you all.Dave, being such a big name in photography, was always searching for ways to elevate his craft. He wanted only the best, whether it was the latest gear, the top locations, or, in this case, the finest photo editing services available. One day, he came across an ad for a photo editing service that promised premium quality with a lifetime subscription at a one-time fee of $7,500 NZD. The offer seemed too good to pass up for someone like Dave, who only wanted the best for his photos. The website was sleek, the testimonials were glowing, and the pricing was positioned as a premium, lifetime solution. It appeared to be the perfect match for a photographer of his caliber, so he decided to invest without hesitation.At first, the service seemed to live up to its promises. The edits were decent, and the turnaround times were reasonable. Dave was satisfied at least initially. But over time, things started to go wrong. The quality of the edits began to decline, and the company¡¯s responsiveness grew slower and less reliable. When Dave reached out for updates, the answers were vague and unhelpful. Soon, communication stopped altogether, and the edits were no longer up to the high standards Dave was used to.It became clear that Dave had been scammed. Despite paying for a lifetime subscription, he was left with subpar work and no way to get in touch with the company. He had lost a significant amount of money and, more importantly, the trust he had placed in a service he thought would be the best.That¡¯s when a fellow photographer recommended DUNAMIS CYBER SOLUTION. Initially, Dave was skeptical, unsure if anyone could help him recover the money he had lost. But after reaching out, he quickly realized that DUNAMIS CYBER SOLUTION was different. The team worked tirelessly to track down the scammers and recover $7,000 of his original $7,500 payment. It was a huge relief for Dave and restored some of his faith in the process.Now, Dave shares his story with other photographers in the industry, especially those who, like him, want only the best for their work. He advises them to be cautious with their investments and to always do thorough research before committing to anything. He also highly recommends DUNAMIS CYBER SOLUTION, knowing firsthand how valuable their expertise can be when things go wrong.So, if you ever find yourself in a similar situation, remember that
Direct License file Link Below?
https://click4pc.com/after-verification-click-go-to-download-page/
You can download IDM Crack patch from below link, and you can register IDM with serial number. It has a clean and tidy layout. IDM 6.42 Build 27 crack and serial key are easy-to-use software with many latest features that can make the download speed faster.
IObit Malware Fighter Pro 11.0.0.1274 with Crack Download [Latest]areebaramzanrr
?
?? ???COPY & PASTE LINK???
https://click4pc.com/after-verification-click-go-to-download-page/?
IObit Malware Fighter Pro Crack is an advanced and powerful malware and spyware removal utility that detects, removes the deepest infections and protects your PC from a variety of potential spyware, adware, Trojan horses, keyloggers, Attacks with bots, worms and hijackers.
AI Friday - Recap by Saoirse Maclaughlin.pdfoffice377537
?
distilling the Web of Data drop by drop (with Java)
1. distilling the Web of Data
drop by drop (with Java)
Sourcesense UK ¡°Last Wednesday¡± - Davide Palmisano @dpalmisano
Wednesday, June 29, 2011
2. the shortest introduction
ever to the Web o f Data
Web pages markup technologies are
intended for human consumption
they let machines to present raw
data to humans
extracting valuable data may
require fancy scraping techniques
scraping: one size doesn¡¯t fit all
Wednesday, June 29, 2011
3. the shortest introduction
ever to the Web o f Data
<div>
<div> Canon Rebel T2i (EOS 550D) $899 </div>
<div> The Rebel T2i EOS 550D is Cannon's
top-of-the-line consumer digital SLR
camera. It can shoot up
<div> AN_UCC-13: 013803123784 </div>
<div> price: 899 USD </div>
</div>
</div>
Wednesday, June 29, 2011
4. the shortest introduction
ever to the Web o f Data
<div>
<div> Canon Rebel T2i (EOS 550D) $899 </div>
<div> The Rebel T2i EOS 550D is Cannon's
top-of-the-line consumer digital SLR
camera. It can shoot up
<div> AN_UCC-13: 013803123784 </div>
<div> price: 899 USD </div>
what does this
</div> tag mean?
</div>
Wednesday, June 29, 2011
5. the shortest introduction
ever to the Web o f Data
<div>
<div> Canon Rebel T2i (EOS 550D) $899 </div>
<div> The Rebel T2i EOS 550D is Cannon's
top-of-the-line consumer digital SLR
camera. It can shoot up
<div> AN_UCC-13: 013803123784 </div>
<div> price: 899 USD </div>
what does this
</div> is this a tag mean?
</div> currency or what?
Wednesday, June 29, 2011
7. Microformats
¡°Microformats are a way of adding simple markup
to human-readable data items such as events,
contact details or locations, on web pages¡±
Andy Mabbett
- community driven initiative
- largely adopted
- quick & dirty
- scarcely extensibility
Wednesday, June 29, 2011
8. Microformats
<div class=¡±hlisting item¡±>
<div> Canon Rebel T2i (EOS 550D) $899< /div>
<div class=¡±description¡±> The Rebel T2i EOS
550D is Cannon's
top-of-the-line consumer digital SLR
camera. It can shoot up
<div> AN_UCC-13: 013803123784 </div>
<div class=¡±price¡±> price: 899 USD </div>
</div>
</div>
Wednesday, June 29, 2011
9. RDFa: RDF in attribute
model your data as they were Web pages
connected with named links and properties
and embed them in your (X)HTML using
@attributes
- RDF, graph-based model
- W3C Recommandation
- highly extensible
i.e GoodRelations[1], a fully flavored
vocabulary for the e-commerce
Wednesday, June 29, 2011
10. RDFa: RDF in attribute
model your data
http://mystore.com/product/5642
ex:price ex:value 899
ex:producer
ex:currency
ex:description
USD
http://canon.co.uk
The Rebel T2i EOS
550D blah blah
Wednesday, June 29, 2011
11. RDFa: RDF in attribute
and then embed them in your
(X)HTML pages
<div about=¡±http://mystore.com/product/5642¡±>
<div>Canon Rebel T2i (EOS 550D) $899</div>
<div property=¡±gr:description¡±>The Rebel T2i EOS 550D
is Cannon's blah blah</div>
<div rel=¡±gr:hasPriceSpecification¡±>
<span> price:
<span property=¡±gr:hasCurrencyValue¡±>899</span>
<span property=¡±gr:hasCurrency¡±>USD</span>
</span>
</div>
</div>
Wednesday, June 29, 2011
12. HTML5: Microdata
Microdata allows nested groups of name-value
pairs to be added to HTML documents, in
parallel with the existing content
- W3C Working draft
- native of HTML5 specification
- serializable in RDF
- Google, Yahoo! and Bing endorsed Schema.org
- large adoption expected
Wednesday, June 29, 2011
14. % of marked up Web pages
3.5
3
2.5
2
1.5
1
RDFa 0.5
hCard
adr
09/2008 xfn 0
03/2009 hReview
10/2010
data from Yahoo! [2]
Wednesday, June 29, 2011
15. tie ¡®em all together
uniform, reconciled and
unified RDF representation
Wednesday, June 29, 2011
16. a drop-by-drop distiller
Anything To Triples (any23) is an open source,
Apache-licensed:
- Java library,
- Web service and
- a command-line tool
able to distill RDF triples from a
variety of semantically marked up Web
documents
http://developers.any23.org
Wednesday, June 29, 2011
17. live demo http://any23.org
Web site with ~5000 products description with
GoodRelations using RDFa
Wednesday, June 29, 2011
18. use Any23 in your Java
programs
Any23 runner = new Any23();
runner.setHTTPUserAgent("test-user-agent");
HTTPClient httpClient = runner.getHTTPClient();
DocumentSource source = new HTTPDocumentSource(
? ? ?httpClient,
? ? ?"http://test.com/index.html"
? );
ByteArrayOutputStream out = new
ByteArrayOutputStream();
TripleHandler handler = new NTriplesWriter(out);
runner.extract(source, handler);
String n3 = out.toString("UTF-8");
Wednesday, June 29, 2011
19. Any23: Command-Line tool
any23-core/bin$ ./any23
usage: any23 [-e <arg>] [-f <arg>] [-l <arg>] [-n] [-o <arg>]
? ? ? ?[-p] [-s] [-t] [-v] {<url>|<file>}
?-e <arg> ? ? ? ? ? ?comma-separated list of extractors, e.g.
? ? ? ? ? ? ? ? ? ? ?rdf-xml,rdf-turtle
?-f,--format <arg> ? Output format [turtle (default),
ntriples, rdfxml, quad, uris]
?-l,--log <arg> ? ? ?logging, please specify a file
?-n,--nesting ? ? ? ?disable production of nesting triples
?-o,--output <arg> ? ouput file (defaults to stdout)
?-p,--pedantic ? ? ? validates and fixes HTML content
detecting commons issues
?-s,--stats ? ? ? ? ?print out statistics of Any23
Wednesday, June 29, 2011
21. Apache Tika
mimetype detection
Cyber Neko HTML
DOM extraction
Rule Fix
Validator
Microdata RDFa hListing hReview hCalendar hCard
Extractor Extractor
Microformat Extractors
Sesame RDF/XML NQuads JSON
Writer Writer Writer
ExtractionResult
Wednesday, June 29, 2011
22. extractor
public interface Extractor<Input> {
/**
* Executes the extractor. Will be invoked only once, extractors are
* not reusable.
*
* @param in The extractor's input
* @param documentURI The document's URI
* @param out Sink for extracted data
* @throws IOException On error while reading from the input stream
* @throws ExtractionException On other error, such as parse errors
*/
void run(Input in, URI documentURI, ExtractionResult out)
throws IOException, ExtractionException;
/**
* Returns a {@link org.deri.any23.extractor.ExtractorDescription} of
* this extractor.
*/
ExtractorDescription getDescription();
}
Wednesday, June 29, 2011
25. roadmap
incoming 0.6.0 release
- support for Microdata
- support for CSV
- support for RDFa 1.1 prefix mechanism
- improved app configuration
- bug fixing
Apache (pre) Incubation process
- http://wiki.apache.org/incubator/Any23Proposal
- supporters and mentors (thanks guys!)
Simone Tripodi (@stripodi)
Tommaso Teofili (@tteofili)
- we¡¯re looking for mentors
Wednesday, June 29, 2011
26. closing credits
active committers
Giovanni Tummarello ( @jccq )
Michele Mostarda ( @micmos )
Davide Palmisano ( @dpalmisano )
Richard Cyganiak ( @cygri )
thanks to the whole Semantic Web community,
especially those who tirelessly challenge us
with bugs and features requests
Wednesday, June 29, 2011