ºÝºÝߣ

ºÝºÝߣShare a Scribd company logo
XML processing with Perl For the 2 nd  YPPUG session by Joe Jiang  [email_address]
XML is a data format, not a language We use it in financial & searching. DMP can also support it, but not as good as text/HTML. Many people use it for configuration files. I have used it at Perl book translation. For  example :  ...  $book/> count $book//sect1 117 $book/> count $book//sect2 149 $book/> count $book//para 4691 # Wah, it's a big book :) ?
The tool to work with XML It's named  XML::XSH2 , by Petr Pajas And it take an  useful  utility named  xsh Which is based on XML::LibXSLT and XML::SAX::Writer, and ... Which is based on XML::LibXML and a lot of ... So you should not expect flat/easy installation :) But it's still possible to be built with  cpanm  utility So I suggest to install  cpanm  first $ curl -kL http://cpanmin.us | perl - --sudo App::cpanminus  $ cpanm -S XML::XSH2 # already made it at  dev , so you can just run: xsh #  ! Finding XML::XSH2 on cpanmetadb failed. # This kind of info is common
How is it used? XPath plus verbs $scratch/> $book := open english-tidyup.xml parsing english-tidyup.xml done. ? $book/> cd  //book/chapter[1] $book/book/chapter[1]> ls  title <title>Introduction</title> $book/book/chapter[1]> cd  / $book/> ls  //chapter/title <title>Introduction</title> <title>Filesystems</title> <title>User Accounts</title> ...
Good at pipeline processing $book/> ls  //sect1//para/text()  | wc -w Found 12398 node(s). 150879 ? Use &quot;wc -m&quot; for Chinese char count. Or make fun with frequency statistics, for top 100 used words: ? $book/> ls  //sect1//para/text()  | perl -MList::MoreUtils=natatime -lane 'END{ $it = natatime 100, sort {$cnt{$b} <=> $cnt{$a}} keys %cnt; print for map {join qq(), $_, $cnt{$_}} $it->() } $cnt{$_}++ for @F' ... data??? 483 ... Perl??? 437 ... file??? 426 ...
It can be used for conversion #1 $scratch/> $x:=open ArticleInfo_9.xml; parsing ArticleInfo_9.xml done. $x/> ls $x <?xml version=&quot;1.0&quot; encoding=&quot;utf-16&quot;?> < СÑù > ?? ? ? ? < ±êÌâ ><![CDATA[ µÚÒ»ÍƼö ]]></ ±êÌâ > ?? ? ? ? < ×÷Õß ><![CDATA[]]></ ×÷Õß > ?? ? ? ? < ÄÚÈÝ ><![CDATA[ ¡¡¡¡»ªÎªÃÀ¹úÍØÕ¹Çó½â ¡¡¡¡»ªÎª¶ÔÃÀ¹úÊг¡µÄÖ´×ÅÏÔʾ³öÖйú¹«Ë¾×ß³öÈ¥µÄ¼±ÇÐÐèÒª£¬µ«ÕâÑù¸ßµ÷×¢¶¨Òª¾­Êܸü¶à´ìÕÛ¡£ ]]></ ÄÚÈÝ > ?? ? ? ? < ¸½Í¼ > ?? ? ? ? ? ? ? ? < ¼òͼ > ?? ? ? ? ? ? ? ? ? ? ? ? < ÎļþÃû >../cnmlfiles/A01/A01Ab25C005_b.jpg</ ÎļþÃû > ?? ? ? ? ? ? ? ? ? ? ? ? < ¸ß >260</ ¸ß > ?? ? ? ? ? ? ? ? ? ? ? ? < ¿í >245</ ¿í > ?? ? ? ? ? ? ? ? </ ¼òͼ > ?? ? ? ? </ ¸½Í¼ > </ СÑù >
Now building an empty xHTML #2 $x/> $y:=new html; $y/> ls $y <?xml version=&quot;1.0&quot; encoding=&quot;utf-8&quot;?> <html/> $y/> xadd element &quot;<head/>&quot; into $y/html;  #xadd is just alias of insert $y/> ls $y <?xml version=&quot;1.0&quot; encoding=&quot;utf-8&quot;?> <html> ?? <head/> </html> ? $y/> xadd element &quot;<title/>&quot; into $y/html/head; $y/> xadd element &quot;<body/>&quot; into $y/html; $y/> ls $y <?xml version=&quot;1.0&quot; encoding=&quot;utf-8&quot;?> <html> ?? <head> ?? ? <title/> ?? </head> ?? <body/> </html>
Copy contents into xHTML #3 $y/> xadd text $x// СÑù / ±êÌâ /text() into $y/html/head/title; $y/> ls $y <?xml version=&quot;1.0&quot; encoding=&quot;utf-8&quot;?> <html> ?? <head> ?? ? <title> µÚÒ»ÍƼö </title> ?? </head> ?? <body/> </html> $y/> xadd text $x// СÑù / ÄÚÈÝ /text() into $y/html/body; $y/> save --file x.html $y; Document saved into file 'x.html'. $y/>Good bye! $  cat x.html <?xml version=&quot;1.0&quot; encoding=&quot;utf-8&quot;?> <html> ?? <head> ?? ? <title> µÚÒ»ÍƼö </title> ?? </head> ?? <body> ¡¡¡¡»ªÎªÃÀ¹úÍØÕ¹Çó½â ¡¡¡¡»ªÎª¶ÔÃÀ¹úÊг¡µÄÖ´×ÅÏÔʾ³öÖйú¹«Ë¾×ß³öÈ¥µÄ¼±ÇÐÐèÒª£¬µ«ÕâÑù¸ßµ÷×¢¶¨Òª¾­Êܸü¶à´ìÕÛ¡£ </body> </html>
XSLT is a focused XML conversion language, based on XPath <? xml version = &quot;1.0&quot;  encoding = &quot;ISO-8859-1&quot; ?> < xsl : stylesheet   version = &quot;1.0&quot;   xmlns : xsl = &quot; http://www.w3.org/1999/XSL/Transform &quot; > < xsl : template   match = &quot;/perldata/hashref&quot; > ?  <table  border = &quot;1&quot; > ??  <tr> ???  <th> Key </th> ???  <th> Value </th> ??  </tr> ???  < xsl : for-each   select = &quot;item&quot; > ???  <tr> ????  <td>< xsl : value-of   select = &quot;@key&quot; /></td> ????  <td>< xsl : value-of   select = &quot;.&quot; /></td> ???  </tr> </ xsl : for-each > </table> </ xsl : template > </ xsl : stylesheet >
This works well with XML::Dumper $ perl -MXML::Dumper -e 'print pl2xml(INC)' | xsltproc hashref.xsl - | w3m -T text/html We can use xsltproc to convert the DocBook book to HTML And to PDF, with another utility named fop Or generate MSWord doc file from openoffice With the help from  openoffice docbook XSLT filter
Now you have been equipped with another tool named XML ? Thanks all for the magic! Module Name Author Version XML::Dumper  MIKEWONG 0.81 XML::Simple  GRANTM 2.18 XML::LibXML  PAJAS 1.87 XML::XPath MSERGEANT 1.13 XML::XSH2  PAJAS 2.1.3 XML::Twig  MIROD 3.38

More Related Content

XML processing with perl

  • 1. XML processing with Perl For the 2 nd YPPUG session by Joe Jiang [email_address]
  • 2. XML is a data format, not a language We use it in financial & searching. DMP can also support it, but not as good as text/HTML. Many people use it for configuration files. I have used it at Perl book translation. For example : ... $book/> count $book//sect1 117 $book/> count $book//sect2 149 $book/> count $book//para 4691 # Wah, it's a big book :) ?
  • 3. The tool to work with XML It's named XML::XSH2 , by Petr Pajas And it take an useful utility named xsh Which is based on XML::LibXSLT and XML::SAX::Writer, and ... Which is based on XML::LibXML and a lot of ... So you should not expect flat/easy installation :) But it's still possible to be built with cpanm utility So I suggest to install cpanm first $ curl -kL http://cpanmin.us | perl - --sudo App::cpanminus $ cpanm -S XML::XSH2 # already made it at dev , so you can just run: xsh # ! Finding XML::XSH2 on cpanmetadb failed. # This kind of info is common
  • 4. How is it used? XPath plus verbs $scratch/> $book := open english-tidyup.xml parsing english-tidyup.xml done. ? $book/> cd //book/chapter[1] $book/book/chapter[1]> ls title <title>Introduction</title> $book/book/chapter[1]> cd / $book/> ls //chapter/title <title>Introduction</title> <title>Filesystems</title> <title>User Accounts</title> ...
  • 5. Good at pipeline processing $book/> ls //sect1//para/text() | wc -w Found 12398 node(s). 150879 ? Use &quot;wc -m&quot; for Chinese char count. Or make fun with frequency statistics, for top 100 used words: ? $book/> ls //sect1//para/text() | perl -MList::MoreUtils=natatime -lane 'END{ $it = natatime 100, sort {$cnt{$b} <=> $cnt{$a}} keys %cnt; print for map {join qq(), $_, $cnt{$_}} $it->() } $cnt{$_}++ for @F' ... data??? 483 ... Perl??? 437 ... file??? 426 ...
  • 6. It can be used for conversion #1 $scratch/> $x:=open ArticleInfo_9.xml; parsing ArticleInfo_9.xml done. $x/> ls $x <?xml version=&quot;1.0&quot; encoding=&quot;utf-16&quot;?> < СÑù > ?? ? ? ? < ±êÌâ ><![CDATA[ µÚÒ»ÍƼö ]]></ ±êÌâ > ?? ? ? ? < ×÷Õß ><![CDATA[]]></ ×÷Õß > ?? ? ? ? < ÄÚÈÝ ><![CDATA[ ¡¡¡¡»ªÎªÃÀ¹úÍØÕ¹Çó½â ¡¡¡¡»ªÎª¶ÔÃÀ¹úÊг¡µÄÖ´×ÅÏÔʾ³öÖйú¹«Ë¾×ß³öÈ¥µÄ¼±ÇÐÐèÒª£¬µ«ÕâÑù¸ßµ÷×¢¶¨Òª¾­Êܸü¶à´ìÕÛ¡£ ]]></ ÄÚÈÝ > ?? ? ? ? < ¸½Í¼ > ?? ? ? ? ? ? ? ? < ¼òͼ > ?? ? ? ? ? ? ? ? ? ? ? ? < ÎļþÃû >../cnmlfiles/A01/A01Ab25C005_b.jpg</ ÎļþÃû > ?? ? ? ? ? ? ? ? ? ? ? ? < ¸ß >260</ ¸ß > ?? ? ? ? ? ? ? ? ? ? ? ? < ¿í >245</ ¿í > ?? ? ? ? ? ? ? ? </ ¼òͼ > ?? ? ? ? </ ¸½Í¼ > </ СÑù >
  • 7. Now building an empty xHTML #2 $x/> $y:=new html; $y/> ls $y <?xml version=&quot;1.0&quot; encoding=&quot;utf-8&quot;?> <html/> $y/> xadd element &quot;<head/>&quot; into $y/html; #xadd is just alias of insert $y/> ls $y <?xml version=&quot;1.0&quot; encoding=&quot;utf-8&quot;?> <html> ?? <head/> </html> ? $y/> xadd element &quot;<title/>&quot; into $y/html/head; $y/> xadd element &quot;<body/>&quot; into $y/html; $y/> ls $y <?xml version=&quot;1.0&quot; encoding=&quot;utf-8&quot;?> <html> ?? <head> ?? ? <title/> ?? </head> ?? <body/> </html>
  • 8. Copy contents into xHTML #3 $y/> xadd text $x// СÑù / ±êÌâ /text() into $y/html/head/title; $y/> ls $y <?xml version=&quot;1.0&quot; encoding=&quot;utf-8&quot;?> <html> ?? <head> ?? ? <title> µÚÒ»ÍƼö </title> ?? </head> ?? <body/> </html> $y/> xadd text $x// СÑù / ÄÚÈÝ /text() into $y/html/body; $y/> save --file x.html $y; Document saved into file 'x.html'. $y/>Good bye! $ cat x.html <?xml version=&quot;1.0&quot; encoding=&quot;utf-8&quot;?> <html> ?? <head> ?? ? <title> µÚÒ»ÍƼö </title> ?? </head> ?? <body> ¡¡¡¡»ªÎªÃÀ¹úÍØÕ¹Çó½â ¡¡¡¡»ªÎª¶ÔÃÀ¹úÊг¡µÄÖ´×ÅÏÔʾ³öÖйú¹«Ë¾×ß³öÈ¥µÄ¼±ÇÐÐèÒª£¬µ«ÕâÑù¸ßµ÷×¢¶¨Òª¾­Êܸü¶à´ìÕÛ¡£ </body> </html>
  • 9. XSLT is a focused XML conversion language, based on XPath <? xml version = &quot;1.0&quot; encoding = &quot;ISO-8859-1&quot; ?> < xsl : stylesheet version = &quot;1.0&quot; xmlns : xsl = &quot; http://www.w3.org/1999/XSL/Transform &quot; > < xsl : template match = &quot;/perldata/hashref&quot; > ? <table border = &quot;1&quot; > ?? <tr> ??? <th> Key </th> ??? <th> Value </th> ?? </tr> ??? < xsl : for-each select = &quot;item&quot; > ??? <tr> ???? <td>< xsl : value-of select = &quot;@key&quot; /></td> ???? <td>< xsl : value-of select = &quot;.&quot; /></td> ??? </tr> </ xsl : for-each > </table> </ xsl : template > </ xsl : stylesheet >
  • 10. This works well with XML::Dumper $ perl -MXML::Dumper -e 'print pl2xml(INC)' | xsltproc hashref.xsl - | w3m -T text/html We can use xsltproc to convert the DocBook book to HTML And to PDF, with another utility named fop Or generate MSWord doc file from openoffice With the help from openoffice docbook XSLT filter
  • 11. Now you have been equipped with another tool named XML ? Thanks all for the magic! Module Name Author Version XML::Dumper MIKEWONG 0.81 XML::Simple GRANTM 2.18 XML::LibXML PAJAS 1.87 XML::XPath MSERGEANT 1.13 XML::XSH2 PAJAS 2.1.3 XML::Twig MIROD 3.38