Presentation given as part of the Zotero Training Workshops, Fall 2012. Original authored in Pandoc markdown and available on github: https://github.com/adam3smith/zotero-workshops
2. Why Learn About Translators?
One-click import from the web is perhaps the key features
that distinguishes Zotero
In this session we will write a screen scraper type translator
for Zotero
Best Case: This will allow you to write translators for sites
you or your clients need
Minimal Case: This will give you an undertstanding on how
translators work and what may be possible, even if youre not
going to do it yourself
3. Some Notes on Zotero Translators
Each Zotero translator is an individual 鍖le, written in
javascript
There are four types of translators: Web, Import, Search,
Export
Some web translators, like those for many libraries, call on an
import translators (e.g. MARC) - we wont learn about those.
Other web translators scrape data from the page - that is
what we will do now
4. (I know last week was Halloween but:) This isnt going to
be scary!
You cannot break Zotero by 鍖ddling with translators - you can
always reset from the advanced panel of the preferences
About 2 years ago, Eric Hetzner of the UC libraries developed
a framework for wrting translators > now you dont need
any javascript. Just Xpaths and regular expressions. And
those are easy!
5. Xpaths are directions on a website
xpaths are basically directions used to point to a part of a
webpage
A webpage is built up from a number of nested nodes
This is what the most simple webpage looks like
<html>
<head>
<title>A Basic Webpage</title>
</head>
<body>
<div id="title">The Title of the webpage</div>
<div id="content" class="text">The Content of the w
</body>
</html>
6. The most basic Xpath
Give directions: at every corner/node, tell Zotero where to go:
Lets say we want to go go to The Content of the webpage
Take the HTLM road, take a left atbody, then take
thediv street, or in HTML:
/html/body/div
7. Making Xpaths more precise
But were still lost - which of the two div streets do we
go down?
Option 1: Take the second <div>
/html/body/div[2]
Option 2: Take the <div> that has content as an id
/html/body/div[@id="content"]
8. Making Xpaths more e鍖cient
In an actual webpage, an xpath can be very long, so wed like
to make them shorter. we can use // to start anywhere in the
html tree, e.g the <div> withcontent as an id
anywhere on the site:
//div[@id="content"]
Sometimes we dont want the precise content of an attribute
like id - in those case we can use contains() as in
//div[contains(@id, "cont")]
We can combine conditions with and or or (in lowercase!)
//div[@id="content" and @class="text"]
11. Our Tools
Sca鍖old - a Firefox extension to write and test the translator
Firefox Inspect Element - to help us understand the
structure of a webpage (there are alternatives like Firebug)