XParent is a simple SQL schema to store XML elements. XML::XParent is a perl module that provides API to store XML files and retrieve XML elements from a XParent data store.
1 of 20
Download to read offline
More Related Content
Xml::parent - Yet another way to store XML files
1. XML::XParent
Another way to store XML elements...
Marco Masetti(grubert) - masetti@linux.it
grubert65@gmail.com
2. Ways of storing XML files
Plain files, simple scripts to perform XPath
queries
trivial, very limited scalability, search and element handling
DBMS as BLOBs (text)
Limited search features, performance and scalability. No
inherent element handling.
DBMS with XML support
Document oriented. Not supported by all. Different features
provided.
Native XML databases (Tamino, Basex, eXist,...)
Okbut then I need something else to talk of
Custom DBMS schemas
Data oriented, element handling trivial, scale very well
3. Custom DBMS schemas
Structure mapping:
the design of the database schema is based on the
understanding of XML Schema or DTDs
Model mapping:
A fixed database schema for all XML documents
without assistance of DTD or XML schemes
4. Structure-mapping schema: XML::RDB!
Perl module to convert XML files into RDB schemas and
populate, and unpopulate them. You end up with 1 table
per each xml element type.
Pros:
Does what he means
Quite fast
Works with XML Schemas too
Could eventually treat value types properly
Cons:
Inherent hierarchical structure lost
Not good if XML files belongs to different schemas
Does only what he means...
Not very well maintained...
SQL schemas can easily become unreadable...
5. Model-mapping schema: XParent !
XParent is a very simple DBMS schema that can be
used to store XML elements
Does not require the XML schema (Schema-oblivious)
Highly normalized
Cons:
Values are stored as text
7. The XML::XParent module
Perl module to handle XML documents on a XParent
schema
Can load any XML file into the same SQL schema
Plugins can be registered for custom logic on elements
Provides utilities to:
Create the XParent schema for SQLite and Postgresql
Parse and load an XML file ( xparent-parse.pl )
Query the XParent schema ( xparent-search.pl )
Classes:
XML::XParent::Parser: XML parser based on XML::Twig
XML::XParent::Parser::Plugin: base interface class to
be implemented by any plugin
XML::XParent::Schema: base class (interface) to the
XParent schema
XML::XParent::Elem: class that describes an XML
element
8. XML::XParent::Schema drivers
The XML::XParent::Schema class implements the
Driver/Interface pattern: in this way custom drivers can
be implemented for specific data stores
2 generic drivers implemented so far:
XML::XParent::Schema::DBIx: driver implementation based on
DBIx::Class
All advantages of an ORM (but who cares ?)
Quite slow!
XML::XParent::Schema::DBI: driver implementation
based on DBI
Direct integration with the data store
Much faster...
9. The quest for speed...
Tests performed on my laptop:
CPU0:Intel(R)Core(TM)i5CPUM540@2.53GHzstepping05
CPU1:Intel(R)Core(TM)i5CPUM540@2.53GHzstepping05
Reference XML file:
Size: 45 MB
XML elements: ~600.000
Reference DBMS: PostgreSQL 8.4.13
Parsing of the reference file with the DBIx driver:
perlxparent足parse.pl足i<ref.xml>足足driverDBIx
Execution time: > 3000 mins !!!
Parsing of the reference file with the DBI driver:
perlxparent足parse.pl足i<ref.xml>足足driverDBI
Execution time: ~ 400 mins.
10. ...But then...
I realized loading times were divergent!
I realized there was a stupid error in the implementation of
the algorith...
Exec Time
(log t)
4
3000
3
400
177
2
28
1
...
m
. ed.
le ch
Im
p
pat
f. go
Re Al
11. ...But then...
I realized that records in Data and DataPath tables are not
referenced by anybody...
They do not need to be inserted one each...
=> Bulk Loading!!!
...given N elements, how many records we have in the
DataPath table ?
12. Bulk Loading
Saves a lot of time storing data:
足足足DBI:Bulkloadingof1000000records足足足
Allinonce:50.462398wallclockseconds
Chunksof1000:31.157044wallclockseconds
Chunksof2000:27.747248wallclockseconds
Chunksof5000:28.209256wallclockseconds
Exec Time Chunksof10000:26.334099wallclockseconds
(log t)
4 Distinct inserts of 1000000 records:
3000
Elapsedtime:250.563282wallclockseconds
3
400
177
2 98
28
1 16
... ...
. d. g.
em he in
pl tc ad
Im pa Lo
f. go lk
Re Al Bu
13. ...But then...
For each element we have to check if path
already exists...
Much better cache it in an hash than go back
and forth into the DB...
Exec Time
(log t)
4
3000
3
400
177
2 98
41
28
16
1 12
... ... ...
.
. d. g.
m e
di
n t hs
le ch Pa
Im
p
pat L oa
f. go lk ed
Re Al Bu ch
Ca
14. ...But then...
Added some indexes:
CREATEINDEXLabelPath_PathONLabelPath(Path);
CREATEINDEXElement_PathIDONElement(PathID);
CREATEINDEXDataPath_CidONDataPath(Cid);
CREATEINDEXDataPath_PidONDataPath(Pid);
CREATEINDEXData_DidONData(Did);
Exec Time
(log t)
4
3000
3
400
177
2 98
41
28
16 29
1 12
8
. ... .
... g. .. ...
m
. ed n s. s.
le h di th xe
p tc oa Pa
m pa L d de
f .I go lk he In
Re Al Bu Ca
c +
15. ...But then...
Realized I could compact records...
<?xmlversion="1.0"encoding="ISO足8859足1"?>
<Mpeg7xmlns="http://www.mpeg7.org/2001/MPEG足7_Schema"
xmlns:xsi="http://www.w3.org/2000/10/XMLSchema足instance">
<DescriptionUnitxsi:type="DescriptorCollectionType">
<Descriptorsize="5"xsi:type="DominantColorType">
<ColorSpacetype="HSV"colorReferenceFlag="false"/>
<SpatialCoherency>0</SpatialCoherency>
<Values>
<Percentage>2</Percentage>
<Index>1060</Index>
</Values>
<Values>
<Percentage>15</Percentage>
<Index>6169</Index>
</Values>
<Values>
<Percentage>3</Percentage>
<Index>7184</Index>
</Values>
</Descriptor>
</DescriptionUnit>
</Mpeg7>
Saves another 20%-30%...
Needs some logic at query time (experimental)...
16. To cut a very long story short...
Time (mins) to load ~600.000 XML elems
Reference Algo Bulk Cached indexes Compact
patched loading Paths
DBIx > 3000 177 98 41 29 22
DBI ~400 28 16 12 8 6
..and we have still to do:
Code profiling...
Specific DBMS techniques...
Use MapReduce to split jobs among several
workers...
17. About retrieval...
At first I tried implementing an Xpath-to-sql
translator
Found it very very hard...
...and almost useless
...use the power of SQL to express what you
want!
XML::XParent provides an API (get_elem) to
query for a set of elements whose paths match
a given SQL regex. The API returns a set of
XML::XParent::Elem objects.
18. XML::XParent utilities: how to use them
Configure parameters into xparent.yml file:
足足足
To load an XML file: schema_params:
perlxparent足parse.pl 足'dbi:Pg:dbname=xparent'
足i<inputfile> #足'dbi:SQLite:xparent.db'
足足driver<theSchemadrivertouse>
足grubert
足grubert
[足足config_file<theconfigfile>]
[足足verbose] 足
AutoCommit:1
[足足clean] #plugins:
[足足compact] #'SLMS::Redis::ParserPlugin':
To query the Xparent data store:#'tag':'MovingRegion'
perlxparent足search.pl
足足path<pathregex>
足足driver<theSchemadrivertouse>
[足足config_file<theconfigfile>]
To clean the data store:
perlxparent足clean.pl
足足driver<theSchemadrivertouse>
[足足config_file<theconfigfile>]