際際滷

際際滷Share a Scribd company logo
XML::XParent
Another way to store XML elements...

             Marco Masetti(grubert) - masetti@linux.it
                                     grubert65@gmail.com
Ways of storing XML files
 Plain files, simple scripts to perform XPath
  queries
  trivial, very limited scalability, search and element handling
 DBMS as BLOBs (text)
  Limited search features, performance and scalability. No
   inherent element handling.
 DBMS with XML support
  Document oriented. Not supported by all. Different features
   provided.
 Native XML databases (Tamino, Basex, eXist,...)
  Okbut then I need something else to talk of
 Custom DBMS schemas
  Data oriented, element handling trivial, scale very well
Custom DBMS schemas

 Structure mapping:
  the design of the database schema is based on the
   understanding of XML Schema or DTDs

 Model mapping:
  A fixed database schema for all XML documents
   without assistance of DTD or XML schemes
Structure-mapping schema: XML::RDB!
 Perl module to convert XML files into RDB schemas and
  populate, and unpopulate them. You end up with 1 table
  per each xml element type.
 Pros:
  
    Does what he means
  
    Quite fast
  
    Works with XML Schemas too
  
    Could eventually treat value types properly
 Cons:
  
    Inherent hierarchical structure lost
  
    Not good if XML files belongs to different schemas
  
    Does only what he means...
  
    Not very well maintained...
  
    SQL schemas can easily become unreadable...
Model-mapping schema: XParent !

 XParent is a very simple DBMS schema that can be
  used to store XML elements
 Does not require the XML schema (Schema-oblivious)
 Highly normalized
 Cons:
  
    Values are stored as text
XParent: how it works...
                     TableLabelPath
                     id|len|path
                     足足足足+足足足足足+足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足
                     1|4|/Mpeg7/DescriptionUnit/Descriptor/ColorSpace
                     2|5|/Mpeg7/DescriptionUnit/Descriptor/ColorSpace/@colorReferenceFlag
<?xmlversion="1.0"encoding="ISO足8859足1"?>
                     3|5|/Mpeg7/DescriptionUnit/Descriptor/ColorSpace/@type
<Mpeg7xmlns="http://www.mpeg7.org/2001/MPEG足7_Schema"
xmlns:xsi="http://www.w3.org/2000/10/XMLSchema足instance">
<DescriptionUnitxsi:type="DescriptorCollectionType">
<Descriptorsize="5"xsi:type="DominantColorType">
                     TableElement
<ColorSpacetype="HSV"colorReferenceFlag="false"/>
                     did|pathid|ordinal
<SpatialCoherency>0</SpatialCoherency>
                     足足足足足+足足足足足足足足+足足足足足足足足足
<Values>     1|1|1
<Percentage>2</Percentage>
                     2|2|1
<Index>1060</Index>
                     3|3|2
</Values>
<Values>
<Percentage>15</Percentage>
                     TableData
<Index>6169</Index>
                     did|pathid|ordinal|value
</Values>
                     足足足足足+足足足足足足足足+足足足足足足足足足+足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足
<Values>
                     2|2|1|false
<Percentage>3</Percentage>
                     3|3|2|HSV
<Index>7184</Index>
</Values>
</Descriptor>
</DescriptionUnit>
</Mpeg7>             TableDataPath
                     pid|cid
                     足足足足足+足足足足足
                     1|2
                     1|3
The XML::XParent module
 Perl module to handle XML documents on a XParent
  schema
 Can load any XML file into the same SQL schema
 Plugins can be registered for custom logic on elements
 Provides utilities to:
  
    Create the XParent schema for SQLite and Postgresql
  
    Parse and load an XML file ( xparent-parse.pl )
  
    Query the XParent schema ( xparent-search.pl )
 Classes:
  
    XML::XParent::Parser: XML parser based on XML::Twig
  
    XML::XParent::Parser::Plugin: base interface class to
    be implemented by any plugin
  
    XML::XParent::Schema: base class (interface) to the
    XParent schema
  
    XML::XParent::Elem: class that describes an XML
    element
XML::XParent::Schema drivers

 The XML::XParent::Schema class implements the
  Driver/Interface pattern: in this way custom drivers can
  be implemented for specific data stores
 2 generic drivers implemented so far:
  
    XML::XParent::Schema::DBIx: driver implementation based on
    DBIx::Class
    
      All advantages of an ORM (but who cares ?)
     
         Quite slow!
 
     XML::XParent::Schema::DBI: driver implementation
     based on DBI
     
       Direct integration with the data store
     
       Much faster...
The quest for speed...


    Tests performed on my laptop:
    
        CPU0:Intel(R)Core(TM)i5CPUM540@2.53GHzstepping05
    
        CPU1:Intel(R)Core(TM)i5CPUM540@2.53GHzstepping05


    Reference XML file:
    
        Size: 45 MB
    
        XML elements: ~600.000

    Reference DBMS: PostgreSQL 8.4.13


    Parsing of the reference file with the DBIx driver:
    
        perlxparent足parse.pl足i<ref.xml>足足driverDBIx
    
      Execution time: > 3000 mins !!!

    Parsing of the reference file with the DBI driver:
    
        perlxparent足parse.pl足i<ref.xml>足足driverDBI
    
        Execution time: ~ 400 mins.
...But then...

  
      I realized loading times were divergent!

  
   I realized there was a stupid error in the implementation of
  the algorith...
Exec Time
(log t)
        4
                            3000
        3
                            400
                                                     177
        2


                                                     28
        1



                                               ...
                       m
                        .                  ed.
                    le                  ch
              Im
                   p
                                    pat
         f.                    go
      Re                     Al
...But then...


  I realized that records in Data and DataPath tables are not
referenced by anybody...

  They do not need to be inserted one each...

  => Bulk Loading!!!

  ...given N elements, how many records we have in the
DataPath table ?
Bulk Loading
                                                   Saves a lot of time storing data:
                                                  足足足DBI:Bulkloadingof1000000records足足足
                                                  Allinonce:50.462398wallclockseconds
                                                  Chunksof1000:31.157044wallclockseconds
                                                  Chunksof2000:27.747248wallclockseconds
                                                  Chunksof5000:28.209256wallclockseconds
Exec Time                                         Chunksof10000:26.334099wallclockseconds
(log t)
        4                                          Distinct inserts of 1000000 records:
                          3000
                                                            Elapsedtime:250.563282wallclockseconds
       3
                          400
                                                 177
       2                                                                   98
                                                 28
       1                                                                   16



                                           ...                       ...
                      .                 d.                        g.
                    em                he                       in
                  pl                tc                      ad
             Im                   pa                      Lo
        f.                   go                      lk
     Re                    Al                     Bu
...But then...
     For each element we have to check if path
      already exists...
     Much better cache it in an hash than go back
      and forth into the DB...
Exec Time
(log t)
       4
                           3000
       3
                           400
                                                    177
       2                                                                      98
                                                                                                             41
                                                    28
                                                                              16
       1                                                                                                     12


                                              ...                       ...                        ...
                                                                                                         .
                       .                   d.                        g.
                      m                   e
                                                                 di
                                                                   n                           t hs
                   le                  ch                                                    Pa
             Im
                  p
                                   pat                       L oa
        f.                    go                        lk                              ed
     Re                     Al                       Bu                               ch
                                                                                   Ca
...But then...
                                       Added some indexes:
                                         CREATEINDEXLabelPath_PathONLabelPath(Path);
                                         CREATEINDEXElement_PathIDONElement(PathID);
                                         CREATEINDEXDataPath_CidONDataPath(Cid);
                                         CREATEINDEXDataPath_PidONDataPath(Pid);
                                         CREATEINDEXData_DidONData(Did);
Exec Time
(log t)
       4
                              3000
       3
                              400
                                                     177
       2                                                                       98
                                                                                                                 41
                                                     28
                                                                               16                                                         29
       1                                                                                                         12
                                                                                                                                          8

                                                 .                       ...                                 .
                                              ...                     g.                                ..                          ...
                         m
                          .                 ed                      n                                s.                           s.
                      le                  h                       di                               th                          xe
                     p                  tc                      oa                               Pa
                 m                    pa                      L                              d                              de
        f   .I                   go                      lk                                he                             In
     Re                        Al                     Bu                            Ca
                                                                                         c                            +
...But then...
 Realized I could compact records...
                 <?xmlversion="1.0"encoding="ISO足8859足1"?>
                 <Mpeg7xmlns="http://www.mpeg7.org/2001/MPEG足7_Schema"
                 xmlns:xsi="http://www.w3.org/2000/10/XMLSchema足instance">
                 <DescriptionUnitxsi:type="DescriptorCollectionType">
                 <Descriptorsize="5"xsi:type="DominantColorType">
                 <ColorSpacetype="HSV"colorReferenceFlag="false"/>
                 <SpatialCoherency>0</SpatialCoherency>
                 <Values>
                 <Percentage>2</Percentage>
                 <Index>1060</Index>
                 </Values>
                 <Values>
                 <Percentage>15</Percentage>
                 <Index>6169</Index>
                 </Values>
                 <Values>
                 <Percentage>3</Percentage>
                 <Index>7184</Index>
                 </Values>
                 </Descriptor>
                 </DescriptionUnit>
                 </Mpeg7>



Saves another 20%-30%...
Needs some logic at query time (experimental)...
To cut a very long story short...
      Time (mins) to load ~600.000 XML elems
             Reference   Algo      Bulk      Cached   indexes   Compact
                         patched   loading   Paths

      DBIx   > 3000      177       98        41       29        22


      DBI    ~400        28        16        12       8         6





    ..and we have still to do:
    
      Code profiling...
    
      Specific DBMS techniques...
    
      Use MapReduce to split jobs among several
      workers...
About retrieval...

 At first I tried implementing an Xpath-to-sql
  translator
 Found it very very hard...
 ...and almost useless
 ...use the power of SQL to express what you
  want!
 XML::XParent provides an API (get_elem) to
  query for a set of elements whose paths match
  a given SQL regex. The API returns a set of
  XML::XParent::Elem objects.
XML::XParent utilities: how to use them
 Configure parameters into xparent.yml file:
                                  足足足
 To load an XML file:            schema_params:
perlxparent足parse.pl             足'dbi:Pg:dbname=xparent'
    足i<inputfile>               #足'dbi:SQLite:xparent.db'
    足足driver<theSchemadrivertouse>
                                  足grubert
                                  足grubert
    [足足config_file<theconfigfile>]
    [足足verbose]                   足
                                  AutoCommit:1
    [足足clean]                     #plugins:
    [足足compact]                   #'SLMS::Redis::ParserPlugin':
 To query the Xparent data store:#'tag':'MovingRegion'
perlxparent足search.pl
   足足path<pathregex>
   足足driver<theSchemadrivertouse>
   [足足config_file<theconfigfile>]
 To clean the data store:
perlxparent足clean.pl
   足足driver<theSchemadrivertouse>
   [足足config_file<theconfigfile>]
Contribute!

https://github.com/grubert65/XParent-Perl.git
Thank You !!!!

More Related Content

Xml::parent - Yet another way to store XML files

  • 1. XML::XParent Another way to store XML elements... Marco Masetti(grubert) - masetti@linux.it grubert65@gmail.com
  • 2. Ways of storing XML files Plain files, simple scripts to perform XPath queries trivial, very limited scalability, search and element handling DBMS as BLOBs (text) Limited search features, performance and scalability. No inherent element handling. DBMS with XML support Document oriented. Not supported by all. Different features provided. Native XML databases (Tamino, Basex, eXist,...) Okbut then I need something else to talk of Custom DBMS schemas Data oriented, element handling trivial, scale very well
  • 3. Custom DBMS schemas Structure mapping: the design of the database schema is based on the understanding of XML Schema or DTDs Model mapping: A fixed database schema for all XML documents without assistance of DTD or XML schemes
  • 4. Structure-mapping schema: XML::RDB! Perl module to convert XML files into RDB schemas and populate, and unpopulate them. You end up with 1 table per each xml element type. Pros: Does what he means Quite fast Works with XML Schemas too Could eventually treat value types properly Cons: Inherent hierarchical structure lost Not good if XML files belongs to different schemas Does only what he means... Not very well maintained... SQL schemas can easily become unreadable...
  • 5. Model-mapping schema: XParent ! XParent is a very simple DBMS schema that can be used to store XML elements Does not require the XML schema (Schema-oblivious) Highly normalized Cons: Values are stored as text
  • 6. XParent: how it works... TableLabelPath id|len|path 足足足足+足足足足足+足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足 1|4|/Mpeg7/DescriptionUnit/Descriptor/ColorSpace 2|5|/Mpeg7/DescriptionUnit/Descriptor/ColorSpace/@colorReferenceFlag <?xmlversion="1.0"encoding="ISO足8859足1"?> 3|5|/Mpeg7/DescriptionUnit/Descriptor/ColorSpace/@type <Mpeg7xmlns="http://www.mpeg7.org/2001/MPEG足7_Schema" xmlns:xsi="http://www.w3.org/2000/10/XMLSchema足instance"> <DescriptionUnitxsi:type="DescriptorCollectionType"> <Descriptorsize="5"xsi:type="DominantColorType"> TableElement <ColorSpacetype="HSV"colorReferenceFlag="false"/> did|pathid|ordinal <SpatialCoherency>0</SpatialCoherency> 足足足足足+足足足足足足足足+足足足足足足足足足 <Values> 1|1|1 <Percentage>2</Percentage> 2|2|1 <Index>1060</Index> 3|3|2 </Values> <Values> <Percentage>15</Percentage> TableData <Index>6169</Index> did|pathid|ordinal|value </Values> 足足足足足+足足足足足足足足+足足足足足足足足足+足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足足 <Values> 2|2|1|false <Percentage>3</Percentage> 3|3|2|HSV <Index>7184</Index> </Values> </Descriptor> </DescriptionUnit> </Mpeg7> TableDataPath pid|cid 足足足足足+足足足足足 1|2 1|3
  • 7. The XML::XParent module Perl module to handle XML documents on a XParent schema Can load any XML file into the same SQL schema Plugins can be registered for custom logic on elements Provides utilities to: Create the XParent schema for SQLite and Postgresql Parse and load an XML file ( xparent-parse.pl ) Query the XParent schema ( xparent-search.pl ) Classes: XML::XParent::Parser: XML parser based on XML::Twig XML::XParent::Parser::Plugin: base interface class to be implemented by any plugin XML::XParent::Schema: base class (interface) to the XParent schema XML::XParent::Elem: class that describes an XML element
  • 8. XML::XParent::Schema drivers The XML::XParent::Schema class implements the Driver/Interface pattern: in this way custom drivers can be implemented for specific data stores 2 generic drivers implemented so far: XML::XParent::Schema::DBIx: driver implementation based on DBIx::Class All advantages of an ORM (but who cares ?) Quite slow! XML::XParent::Schema::DBI: driver implementation based on DBI Direct integration with the data store Much faster...
  • 9. The quest for speed... Tests performed on my laptop: CPU0:Intel(R)Core(TM)i5CPUM540@2.53GHzstepping05 CPU1:Intel(R)Core(TM)i5CPUM540@2.53GHzstepping05 Reference XML file: Size: 45 MB XML elements: ~600.000 Reference DBMS: PostgreSQL 8.4.13 Parsing of the reference file with the DBIx driver: perlxparent足parse.pl足i<ref.xml>足足driverDBIx Execution time: > 3000 mins !!! Parsing of the reference file with the DBI driver: perlxparent足parse.pl足i<ref.xml>足足driverDBI Execution time: ~ 400 mins.
  • 10. ...But then... I realized loading times were divergent! I realized there was a stupid error in the implementation of the algorith... Exec Time (log t) 4 3000 3 400 177 2 28 1 ... m . ed. le ch Im p pat f. go Re Al
  • 11. ...But then... I realized that records in Data and DataPath tables are not referenced by anybody... They do not need to be inserted one each... => Bulk Loading!!! ...given N elements, how many records we have in the DataPath table ?
  • 12. Bulk Loading Saves a lot of time storing data: 足足足DBI:Bulkloadingof1000000records足足足 Allinonce:50.462398wallclockseconds Chunksof1000:31.157044wallclockseconds Chunksof2000:27.747248wallclockseconds Chunksof5000:28.209256wallclockseconds Exec Time Chunksof10000:26.334099wallclockseconds (log t) 4 Distinct inserts of 1000000 records: 3000 Elapsedtime:250.563282wallclockseconds 3 400 177 2 98 28 1 16 ... ... . d. g. em he in pl tc ad Im pa Lo f. go lk Re Al Bu
  • 13. ...But then... For each element we have to check if path already exists... Much better cache it in an hash than go back and forth into the DB... Exec Time (log t) 4 3000 3 400 177 2 98 41 28 16 1 12 ... ... ... . . d. g. m e di n t hs le ch Pa Im p pat L oa f. go lk ed Re Al Bu ch Ca
  • 14. ...But then... Added some indexes: CREATEINDEXLabelPath_PathONLabelPath(Path); CREATEINDEXElement_PathIDONElement(PathID); CREATEINDEXDataPath_CidONDataPath(Cid); CREATEINDEXDataPath_PidONDataPath(Pid); CREATEINDEXData_DidONData(Did); Exec Time (log t) 4 3000 3 400 177 2 98 41 28 16 29 1 12 8 . ... . ... g. .. ... m . ed n s. s. le h di th xe p tc oa Pa m pa L d de f .I go lk he In Re Al Bu Ca c +
  • 15. ...But then... Realized I could compact records... <?xmlversion="1.0"encoding="ISO足8859足1"?> <Mpeg7xmlns="http://www.mpeg7.org/2001/MPEG足7_Schema" xmlns:xsi="http://www.w3.org/2000/10/XMLSchema足instance"> <DescriptionUnitxsi:type="DescriptorCollectionType"> <Descriptorsize="5"xsi:type="DominantColorType"> <ColorSpacetype="HSV"colorReferenceFlag="false"/> <SpatialCoherency>0</SpatialCoherency> <Values> <Percentage>2</Percentage> <Index>1060</Index> </Values> <Values> <Percentage>15</Percentage> <Index>6169</Index> </Values> <Values> <Percentage>3</Percentage> <Index>7184</Index> </Values> </Descriptor> </DescriptionUnit> </Mpeg7> Saves another 20%-30%... Needs some logic at query time (experimental)...
  • 16. To cut a very long story short... Time (mins) to load ~600.000 XML elems Reference Algo Bulk Cached indexes Compact patched loading Paths DBIx > 3000 177 98 41 29 22 DBI ~400 28 16 12 8 6 ..and we have still to do: Code profiling... Specific DBMS techniques... Use MapReduce to split jobs among several workers...
  • 17. About retrieval... At first I tried implementing an Xpath-to-sql translator Found it very very hard... ...and almost useless ...use the power of SQL to express what you want! XML::XParent provides an API (get_elem) to query for a set of elements whose paths match a given SQL regex. The API returns a set of XML::XParent::Elem objects.
  • 18. XML::XParent utilities: how to use them Configure parameters into xparent.yml file: 足足足 To load an XML file: schema_params: perlxparent足parse.pl 足'dbi:Pg:dbname=xparent' 足i<inputfile> #足'dbi:SQLite:xparent.db' 足足driver<theSchemadrivertouse> 足grubert 足grubert [足足config_file<theconfigfile>] [足足verbose] 足 AutoCommit:1 [足足clean] #plugins: [足足compact] #'SLMS::Redis::ParserPlugin': To query the Xparent data store:#'tag':'MovingRegion' perlxparent足search.pl 足足path<pathregex> 足足driver<theSchemadrivertouse> [足足config_file<theconfigfile>] To clean the data store: perlxparent足clean.pl 足足driver<theSchemadrivertouse> [足足config_file<theconfigfile>]