�ݺ�ߣ

XML::XParent
Another way to store XML elements...

Marco Masetti(grubert) - masetti@linux.it
grubert65@gmail.com

Ways of storing XML files
• Plain files, simple scripts to perform XPath
queries
– trivial, very limited scalability, search and element handling
• DBMS as BLOBs (text)
– Limited search features, performance and scalability. No
inherent element handling.
• DBMS with XML support
– Document oriented. Not supported by all. Different features
provided.
• Native XML databases (Tamino, Basex, eXist,...)
– Ok…but then I need something else to talk of…
• Custom DBMS schemas
– Data oriented, element handling trivial, scale very well

Custom DBMS schemas

• Structure mapping:
– the design of the database schema is based on the
understanding of XML Schema or DTDs

• Model mapping:
– A fixed database schema for all XML documents
without assistance of DTD or XML schemes

Structure-mapping schema: XML::RDB!
• Perl module to convert XML files into RDB schemas and
populate, and unpopulate them. You end up with 1 table
per each xml element type.
• Pros:
●
Does what he means
●
Quite fast
●
Works with XML Schemas too
●
Could eventually treat value types properly
• Cons:
●
Inherent hierarchical structure lost
●
Not good if XML files belongs to different schemas
●
Does only what he means...
●
Not very well maintained...
●
SQL schemas can easily become unreadable...

Model-mapping schema: XParent !

• XParent is a very simple DBMS schema that can be
used to store XML elements
• Does not require the XML schema (Schema-oblivious)
• Highly normalized
• Cons:

Values are stored as text

XParent: how it works...
Table LabelPath
id | len |                               path
++
  1 |   4 | /Mpeg7/DescriptionUnit/Descriptor/ColorSpace
  2 |   5 | /Mpeg7/DescriptionUnit/Descriptor/ColorSpace/@colorReferenceFlag
<?xml version="1.0" encoding="ISO88591"?>
  3 |   5 | /Mpeg7/DescriptionUnit/Descriptor/ColorSpace/@type
  <Mpeg7 xmlns="http://www.mpeg7.org/2001/MPEG7_Schema"
         xmlns:xsi="http://www.w3.org/2000/10/XMLSchemainstance">
    <DescriptionUnit xsi:type="DescriptorCollectionType">
      <Descriptor size="5" xsi:type="DominantColorType">
Table Element
        <ColorSpace type="HSV" colorReferenceFlag="false"/>
did | pathid | ordinal
        <SpatialCoherency>0</SpatialCoherency>
++
        <Values>    1 |      1 |       1
        <Percentage>2</Percentage>
   2 |      2 |       1
        <Index>10 6 0</Index>
   3 |      3 |       2
        </Values>
        <Values>
Table Data
did | pathid | ordinal |                    value
        </Values>
+++
        <Values>
   2 |      2 |       1 | false
   3 |      3 |       2 | HSV
      </Values>
    </Descriptor>
  </DescriptionUnit>
</Mpeg7> Table DataPath
pid | cid
+
   1 |   2
   1 |   3

The XML::XParent module
• Perl module to handle XML documents on a XParent
schema
• Can load any XML file into the same SQL schema
• Plugins can be registered for custom logic on elements
• Provides utilities to:
●
Create the XParent schema for SQLite and Postgresql
●
Parse and load an XML file ( xparent-parse.pl )
●
Query the XParent schema ( xparent-search.pl )
• Classes:
●
XML::XParent::Parser: XML parser based on XML::Twig
●
XML::XParent::Parser::Plugin: base interface class to
be implemented by any plugin
●
XML::XParent::Schema: base class (interface) to the
XParent schema
●
XML::XParent::Elem: class that describes an XML
element

XML::XParent::Schema drivers

• The XML::XParent::Schema class implements the
Driver/Interface pattern: in this way custom drivers can
be implemented for specific data stores
• 2 generic drivers implemented so far:

XML::XParent::Schema::DBIx: driver implementation based on
DBIx::Class
●
All advantages of an ORM (but who cares ?)
●
Quite slow!

XML::XParent::Schema::DBI: driver implementation
based on DBI
●
Direct integration with the data store
●
Much faster...

The quest for speed...

●
Tests performed on my laptop:
●
CPU0: Intel(R) Core(TM) i5 CPU M 540@ 2.53GHz stepping 05
●
CPU1: Intel(R) Core(TM) i5 CPU M 540@ 2.53GHz stepping 05

●
Reference XML file:
●
Size: 45 MB
●
XML elements: ~600.000
●
Reference DBMS: PostgreSQL 8.4.13

●
Parsing of the reference file with the DBIx driver:
●
perl xparentparse.pl i <ref.xml> driver DBIx
●
Execution time: > 3000 mins !!!
●
Parsing of the reference file with the DBI driver:
●
perl xparentparse.pl i <ref.xml> driver DBI
●
Execution time: ~ 400 mins.

...But then...

●
I realized loading times were divergent!

●
I realized there was a stupid error in the implementation of
the algorith...
Exec Time
(log t)
4
3000
3
400
177
2

28
1

...
m
. ed.
le ch
Im
p
pat
f. go
Re Al

...But then...

●
I realized that records in Data and DataPath tables are not
referenced by anybody...
●
They do not need to be inserted one each...
●
=> Bulk Loading!!!
●
...given N elements, how many records we have in the
DataPath table ?

Bulk Loading
• Saves a lot of time storing data:
DBI: Bulk loading of 1000000 records
All in once: 50.462398 wallclock seconds
Chunks of 1000: 31.157044 wallclock seconds
Exec Time Chunks of 10000:26.334099 wallclock seconds
(log t)
4 • Distinct inserts of 1000000 records:
3000
Elapsed time: 250.563282 wallclock seconds
3
400
177
2 98
28
1 16

... ...
. d. g.
em he in
pl tc ad
Im pa Lo
f. go lk
Re Al Bu

...But then...
• For each element we have to check if path
already exists...
• Much better cache it in an hash than go back
and forth into the DB...
Exec Time
(log t)
4
3000
3
400
177
2 98
41
28
16
1 12

... ... ...
.
. d. g.
m e
di
n t hs
le ch Pa
Im
p
pat L oa
f. go lk ed
Re Al Bu ch
Ca

...But then...
• Added some indexes:
• CREATE INDEX LabelPath_Path ON LabelPath (Path);
• CREATE INDEX Element_PathID ON Element (PathID);
• CREATE INDEX DataPath_Cid ON DataPath (Cid);
• CREATE INDEX DataPath_Pid ON DataPath (Pid);
• CREATE INDEX Data_Did ON Data (Did);
Exec Time
(log t)
4
3000
3
400
177
2 98
41
28
16 29
1 12
8

. ... .
... g. .. ...
m
. ed n s. s.
le h di th xe
p tc oa Pa
m pa L d de
f .I go lk he In
Re Al Bu Ca
c +

...But then...
• Realized I could “compact” records...
<?xml version="1.0" encoding="ISO88591"?>
  <Mpeg7 xmlns="http://www.mpeg7.org/2001/MPEG7_Schema"
         xmlns:xsi="http://www.w3.org/2000/10/XMLSchemainstance">
    <DescriptionUnit xsi:type="DescriptorCollectionType">
      <Descriptor size="5" xsi:type="DominantColorType">
        <ColorSpace type="HSV" colorReferenceFlag="false"/>
        <SpatialCoherency>0</SpatialCoherency>
        <Values>
        </Values>
        <Values>
        </Values>
        <Values>
        </Values>
    </Descriptor>
  </DescriptionUnit>
</Mpeg7>

Saves another 20%-30%...
Needs some logic at query time (experimental)...

To cut a very long story short...
Time (mins) to load ~600.000 XML elems
Reference Algo Bulk Cached indexes Compact
patched loading Paths

DBIx > 3000 177 98 41 29 22

DBI ~400 28 16 12 8 6

●
..and we have still to do:
●
Code profiling...
●
Specific DBMS techniques...
●
Use MapReduce to split jobs among several
workers...

About retrieval...

• At first I tried implementing an Xpath-to-sql
translator
• Found it very very hard...
• ...and almost useless
• ...use the power of SQL to express what you
want!
• XML::XParent provides an API (get_elem) to
query for a set of elements whose paths match
a given SQL regex. The API returns a set of
XML::XParent::Elem objects.

XML::XParent utilities: how to use them
• Configure parameters into xparent.yml file:

• To load an XML file: schema_params:
perl xparentparse.pl      'dbi:Pg:dbname=xparent'
i <input file> #     'dbi:SQLite:xparent.db'
driver <the Schema driver to use>
     grubert
     grubert
[config_file <the config file>]
[verbose]
        AutoCommit: 1
[clean] #plugins:
[compact] #    'SLMS::Redis::ParserPlugin':
• To query the Xparent data store:#        'tag': 'MovingRegion'
perl xparentsearch.pl
path <path regex>
• To clean the data store:
perl xparentclean.pl

Contribute!

https://github.com/grubert65/XParent-Perl.git

�ݺ�ߣ

Xml::parent - Yet another way to store XML files

More Related Content

Xml::parent - Yet another way to store XML files