![]() | |
![]() |
![]() |
![]() | |
![]() | |
Performance Comparison Between SAX XML::Filter::Dispatcher and XML::Twig | |
![]() |
There is a good commentary on this article at Comments on "Performance Comparison Between SAX XML::Filter::Dispatcher and XML::Twig". This article includes cautions about some potential problems with my XML::Twig code.
This code is extracted from a larger application, and simplified. I intend to write XML processors for many different XML formats. I want to choose XML processing tools and techniques carefully. This is especially important since I will need to scale up this code and get others to work on it. I anticipate supporting at least fifty XML formats each containing dozens of unique tags. XML file sizes will vary from a few hundred bytes to perhaps hundreds of megabytes.
I looked first at XML::Twig, and then Matt Sergeant recommended that I consider XML::SAX. I tried it, but complained that the resulting code contains large case statements, which I don't like. Since I prefer a dispatch table approach, Matt suggested XML::Filter::Dispatcher. After I got some help with the Dispatcher documentation from Barrie Slaymaker, I was able to get this SAX-based approach to work.
I trimmed my program down and simplified it for the purposes of benchmarking. At first I thought that the XML::SAX approach was very slow. This was because XML::SAX can use a variety of back-end processors, and I had inadvertantly allowed it to use the default back-end, which on my system is slow. I added information to my program to select the fastest back-end, and the results became more interesting.
The execution times of these two programs are within 2% of each other. This is a negligable difference that is close to the measurement uncertainty. These times were the fastest execution times for all of the experiments done. These experiments included changes in code layout, anonymous versus named subroutines, memory usage, and file IO.
The XML::Twig version uses much less memory. For my test dataset, XML::Filter::Dispatcher uses about 30M, and the fast version of XML::Twig uses about 12M. Turning on the $less_memory flag reduces memory to about 6.6M, at the expense of increasing the compute time by about 30%. Since I work with both speed-critical and memory-critical applications, I like this ability to trade off performance.
I learned an interesting performance optimization when writing the anonymous subs for XML::Twig. These subs should not uselessly return a long string. Processing this string can increase processing time by 50% in this example. This is why the start_tag_handlers return the value 1. In the twig_handlers, the if statement for the speed/memory tradeoff prevents the string from being returned.
Adding purge calls to the start_tag_handlers did not save memory, but still cost speed, so purge is not called for those handlers. In this application, all the handlers except for the HEADER tag could use start_tag_handlers. In this case, the memory usage is 12M whether purge is called or not. The desire to be able to use purge caused me to use twig handlers whenever possible.
The start_tag_handlers were needed to preserve the document order in the output file. The SAX version of the program didn't require any special code to make the output come out in document order.
#!/usr/bin/perl # # Program to test performance of XML::Twig # Tom Anderson # Thu Jan 16 22:49:24 PST 2003 # use strict; use warnings; use diagnostics; use XML::Twig; use File::Slurp; my $VERSION=0.02; my $less_memory=1; my $out=""; my $xml= XML::Twig->new( start_tag_handlers => { '/TRACES/NET/WIR/index.html' => sub { $out .= 'WIR '. $_[1]->{att}->{numseg} .' '. $_[1]->{att}->{startx} .' '. $_[1]->{att}->{starty} .' '. $_[1]->{att}->{termx} .' '. $_[1]->{att}->{termy} .' '. $_[1]->{att}->{optgroup}."\n"; 1; }, '/TRACES/NET/index.html' => sub { $out .= "# NET '". $_[1]->{att}->{name}."'\n"; 1; }, }, twig_handlers => { '/TRACES/UNITS/index.html' => sub { $out .= 'UNITS '. $_[1]->{att}->{val}."\n"; $_[0]->purge if $less_memory; }, '/TRACES/STFIRST/index.html' => sub { $out .= 'ST '. $_[1]->{att}->{maxx} .' '. $_[1]->{att}->{maxy} .' '. $_[1]->{att}->{maxroute}.' '. $_[1]->{att}->{numconn} ."\n"; $_[0]->purge if $less_memory; }, '/TRACES/XRF/index.html' => sub { $out .= 'XRF '. $_[1]->{att}->{num} .' '. $_[1]->{att}->{name}."\n"; $_[0]->purge if $less_memory; }, '/TRACES/NET/WIR/SEG/index.html' => sub { $out .= 'SEG '. $_[1]->{att}->{x} .' '. $_[1]->{att}->{y} .' '. $_[1]->{att}->{lay} .' '. $_[1]->{att}->{width}."\n"; $_[0]->purge if $less_memory; }, '/TRACES/NET/GUI/index.html' => sub { $out .= 'GUI '. $_[1]->{att}->{startx} .' '. $_[1]->{att}->{starty} .' '. $_[1]->{att}->{startlay}.' '. $_[1]->{att}->{termx} .' '. $_[1]->{att}->{termy} .' '. $_[1]->{att}->{termlay} .' '. $_[1]->{att}->{optgroup}."\n"; $_[0]->purge if $less_memory; }, '/TRACES/STLAST/index.html' => sub { $out .= 'ST '. $_[1]->{att}->{checkstat} .' '. $_[1]->{att}->{numcomplete}.' '. $_[1]->{att}->{numinc} .' '. $_[1]->{att}->{numunroute} .' '. $_[1]->{att}->{numnotrace} .' '. $_[1]->{att}->{numfill} ."\n"; $_[0]->purge if $less_memory; }, '/TRACES/HEADER/index.html' => sub { $out .= $_[1]->text; $_[0]->purge if $less_memory; }, }, ); $xml->parsefile('/perl/xml/traces.xml'); write_file('twigex.out', $out);
#!/usr/bin/perl # # Program to test performance of XML::SAX Parsers # Tom Anderson # Wed Jan 15 20:06:24 PST 2003 # use strict; use warnings; use diagnostics; use XML::Filter::Dispatcher qw ( :all ); use XML::SAX::Machines qw( Pipeline ); use IO::File; my $VERSION=0.01; $XML::SAX::ParserPackage = ""; # Suppress 'used only once' warning. $XML::SAX::ParserPackage .= "XML::LibXML::SAX(1.00)"; my $out=''; my $xml= new XML::Filter::Dispatcher->new( Rules => { '/TRACES/HEADER/index.html' => [ 'string()' => sub { $out .= xvalue(); } ], '/TRACES/UNITS/index.html' => sub { $out .= 'UNITS '. $_[1]->{Attributes}->{"{}val"}->{Value}."\n"; 1 }, '/TRACES/STFIRST/index.html' => sub { $out .= 'ST '. $_[1]->{Attributes}->{"{}maxx"}->{Value} .' '. $_[1]->{Attributes}->{"{}maxy"}->{Value} .' '. $_[1]->{Attributes}->{"{}maxroute"}->{Value}.' '. $_[1]->{Attributes}->{"{}numconn"}->{Value} ."\n"; 1 }, '/TRACES/XRF/index.html' => sub { $out .= 'XRF '. $_[1]->{Attributes}->{"{}num"}->{Value} .' '. $_[1]->{Attributes}->{"{}name"}->{Value}."\n"; 1 }, '/TRACES/NET/index.html' => sub { $out .= "# NET '". $_[1]->{Attributes}->{"{}name"}->{Value}."'\n"; 1 }, '/TRACES/NET/WIR/index.html' => sub { $out .= 'WIR ' . $_[1]->{Attributes}->{"{}numseg"}->{Value} .' '. $_[1]->{Attributes}->{"{}startx"}->{Value} .' '. $_[1]->{Attributes}->{"{}starty"}->{Value} .' '. $_[1]->{Attributes}->{"{}termx"}->{Value} .' '. $_[1]->{Attributes}->{"{}termy"}->{Value} .' '. $_[1]->{Attributes}->{"{}optgroup"}->{Value}."\n"; 1 }, '/TRACES/NET/WIR/SEG/index.html' => sub { $out .= 'SEG '. $_[1]->{Attributes}->{"{}x"}->{Value} .' '. $_[1]->{Attributes}->{"{}y"}->{Value} .' '. $_[1]->{Attributes}->{"{}lay"}->{Value} .' '. $_[1]->{Attributes}->{"{}width"}->{Value} ."\n"; 1 }, '/TRACES/NET/GUI/index.html' => sub { $out .= 'GUI ' . $_[1]->{Attributes}->{"{}startx"}->{Value} .' '. $_[1]->{Attributes}->{"{}starty"}->{Value} .' '. $_[1]->{Attributes}->{"{}startlay"}->{Value}.' '. $_[1]->{Attributes}->{"{}termx"}->{Value} .' '. $_[1]->{Attributes}->{"{}termy"}->{Value} .' '. $_[1]->{Attributes}->{"{}termlay"}->{Value} .' '. $_[1]->{Attributes}->{"{}optgroup"}->{Value}."\n"; 1 }, '/TRACES/STLAST/index.html' => sub { $out .= 'ST ' . $_[1]->{Attributes}->{"{}checkstat"}->{Value} .' '. $_[1]->{Attributes}->{"{}numcomplete"}->{Value}.' '. $_[1]->{Attributes}->{"{}numinc"}->{Value} .' '. $_[1]->{Attributes}->{"{}numunroute"}->{Value} .' '. $_[1]->{Attributes}->{"{}numnotrace"}->{Value} .' '. $_[1]->{Attributes}->{"{}numfill"}->{Value} ."\n"; 1 }, }, ); my $xml_fh = new IO::File '/perl/xml/traces.xml'; Pipeline($xml)->parse_file($xml_fh); $xml_fh->close; my $traces_fh= new IO::File '>saxtraces.out'; print $traces_fh $out; $traces_fh->close;
traces.xml input test file.
Thanks to Matt Sergeant, Barrie Slaymaker and Michel Rodriguez. They are all responsive to questions and supportive of perl users. They prove why open-source software is better.