際際滷

際際滷Share a Scribd company logo
Creating Zotero Translators Using Framework


               Sebastian Karcher



         Boston College, November 2012
Why Learn About Translators?



      One-click import from the web is perhaps the key features
      that distinguishes Zotero
      In this session we will write a screen scraper type translator
      for Zotero
      Best Case: This will allow you to write translators for sites
      you or your clients need
      Minimal Case: This will give you an undertstanding on how
      translators work and what may be possible, even if youre not
      going to do it yourself
Some Notes on Zotero Translators



      Each Zotero translator is an individual 鍖le, written in
      javascript
      There are four types of translators: Web, Import, Search,
      Export
      Some web translators, like those for many libraries, call on an
      import translators (e.g. MARC) - we wont learn about those.
      Other web translators scrape data from the page - that is
      what we will do now
(I know last week was Halloween but:) This isnt going to
be scary!



      You cannot break Zotero by 鍖ddling with translators - you can
      always reset from the advanced panel of the preferences
      About 2 years ago, Eric Hetzner of the UC libraries developed
      a framework for wrting translators > now you dont need
      any javascript. Just Xpaths and regular expressions. And
      those are easy!
Xpaths are directions on a website

      xpaths are basically directions used to point to a part of a
      webpage
      A webpage is built up from a number of nested nodes
      This is what the most simple webpage looks like

   <html>
       <head>
           <title>A Basic Webpage</title>
       </head>
       <body>
           <div id="title">The Title of the webpage</div>
           <div id="content" class="text">The Content of the w
       </body>
   </html>
The most basic Xpath




      Give directions: at every corner/node, tell Zotero where to go:
      Lets say we want to go go to The Content of the webpage
      Take the HTLM road, take a left atbody, then take
      thediv street, or in HTML:
      /html/body/div
Making Xpaths more precise



      But were still lost - which of the two div streets do we
      go down?
      Option 1: Take the second <div>
      /html/body/div[2]
      Option 2: Take the <div> that has content as an id
      /html/body/div[@id="content"]
Making Xpaths more e鍖cient


      In an actual webpage, an xpath can be very long, so wed like
      to make them shorter. we can use // to start anywhere in the
      html tree, e.g the <div> withcontent as an id
      anywhere on the site:
      //div[@id="content"]
      Sometimes we dont want the precise content of an attribute
      like id - in those case we can use contains() as in
      //div[contains(@id, "cont")]
      We can combine conditions with and or or (in lowercase!)
      //div[@id="content" and @class="text"]
A sample Framework Translator - Single item

   /** Articles */
   FW.Scraper({
   itemType : blogPost,
   detect : FW.Xpath(//h1[@class="article-title"]),
   title : FW.Xpath(//h1[@class="article-title"]).text().tri
   attachments : {
      url : FW.Url(),title : "voxEU snapshot",type : "text/html
   },
   creators : FW.Xpath(//div[@class="author"]
   //span[@class="field-content"]/a).text().cleanAuthor("auth
   abstractNote : FW.Xpath(//div[@class="article-teaser"]).t
   date : FW.Xpath(//h1[@class="article-title"]/following-sib
   publicationTitle : "VoxEU.org"
   });
A sample Framework Translator - Multiples


   /** Search results */
   FW.MultiScraper({
   itemType : "multiple",
   detect : FW.Xpath(//ul[contains(@class, "search-results")]
   choices : {
   titles : FW.Xpath(//ul[contains(@class, "search-results")]
               /li/h2/a).text(),
   urls : FW.Xpath(//ul[contains(@class, "search-results")]
               /li/h2/a).key(href).text()
   }
   });
Our Tools




      Sca鍖old - a Firefox extension to write and test the translator
      Firefox Inspect Element - to help us understand the
      structure of a webpage (there are alternatives like Firebug)

More Related Content

Zotero Framework Translators

  • 1. Creating Zotero Translators Using Framework Sebastian Karcher Boston College, November 2012
  • 2. Why Learn About Translators? One-click import from the web is perhaps the key features that distinguishes Zotero In this session we will write a screen scraper type translator for Zotero Best Case: This will allow you to write translators for sites you or your clients need Minimal Case: This will give you an undertstanding on how translators work and what may be possible, even if youre not going to do it yourself
  • 3. Some Notes on Zotero Translators Each Zotero translator is an individual 鍖le, written in javascript There are four types of translators: Web, Import, Search, Export Some web translators, like those for many libraries, call on an import translators (e.g. MARC) - we wont learn about those. Other web translators scrape data from the page - that is what we will do now
  • 4. (I know last week was Halloween but:) This isnt going to be scary! You cannot break Zotero by 鍖ddling with translators - you can always reset from the advanced panel of the preferences About 2 years ago, Eric Hetzner of the UC libraries developed a framework for wrting translators > now you dont need any javascript. Just Xpaths and regular expressions. And those are easy!
  • 5. Xpaths are directions on a website xpaths are basically directions used to point to a part of a webpage A webpage is built up from a number of nested nodes This is what the most simple webpage looks like <html> <head> <title>A Basic Webpage</title> </head> <body> <div id="title">The Title of the webpage</div> <div id="content" class="text">The Content of the w </body> </html>
  • 6. The most basic Xpath Give directions: at every corner/node, tell Zotero where to go: Lets say we want to go go to The Content of the webpage Take the HTLM road, take a left atbody, then take thediv street, or in HTML: /html/body/div
  • 7. Making Xpaths more precise But were still lost - which of the two div streets do we go down? Option 1: Take the second <div> /html/body/div[2] Option 2: Take the <div> that has content as an id /html/body/div[@id="content"]
  • 8. Making Xpaths more e鍖cient In an actual webpage, an xpath can be very long, so wed like to make them shorter. we can use // to start anywhere in the html tree, e.g the <div> withcontent as an id anywhere on the site: //div[@id="content"] Sometimes we dont want the precise content of an attribute like id - in those case we can use contains() as in //div[contains(@id, "cont")] We can combine conditions with and or or (in lowercase!) //div[@id="content" and @class="text"]
  • 9. A sample Framework Translator - Single item /** Articles */ FW.Scraper({ itemType : blogPost, detect : FW.Xpath(//h1[@class="article-title"]), title : FW.Xpath(//h1[@class="article-title"]).text().tri attachments : { url : FW.Url(),title : "voxEU snapshot",type : "text/html }, creators : FW.Xpath(//div[@class="author"] //span[@class="field-content"]/a).text().cleanAuthor("auth abstractNote : FW.Xpath(//div[@class="article-teaser"]).t date : FW.Xpath(//h1[@class="article-title"]/following-sib publicationTitle : "VoxEU.org" });
  • 10. A sample Framework Translator - Multiples /** Search results */ FW.MultiScraper({ itemType : "multiple", detect : FW.Xpath(//ul[contains(@class, "search-results")] choices : { titles : FW.Xpath(//ul[contains(@class, "search-results")] /li/h2/a).text(), urls : FW.Xpath(//ul[contains(@class, "search-results")] /li/h2/a).key(href).text() } });
  • 11. Our Tools Sca鍖old - a Firefox extension to write and test the translator Firefox Inspect Element - to help us understand the structure of a webpage (there are alternatives like Firebug)