Go Back   Rhinocerus > Newsgroup > Newsgroup comp.lang.perl.misc

Reply
 
Thread Tools Display Modes
  #16 (permalink)  
Old 11-16-2011, 04:06 PM
ccc31807
Guest
 
Posts: n/a
Default Re: How to import only part of a large XML file?

On Nov 16, 11:50*am, Rainer Weikusat <rweiku...@mssgmbh.com> wrote:
> AFAIK, a well-formed XML file could have an order description looking like
> this:
>
> <order
>
> >1</order

>
> meaning, it is not really possible to parse XML without doing a
> character-by-character lexical analysis of the input data stream
> first.


As I said, it depends on the nature of your input. XML handles
'ragged' data as well as the kind of normalized data we would expect
to use for an RDBMS. If you aren't sure of the format of your data,
you obviously have to validate it somehow. Part of this might be
removing whitespace at the beginning and ends of lines. Sometimes it
might be removing newlines from several lines until you match some
kind of closing tag.

I don't advocate reinventing wheels. I also don't advocate searching
for a CPAN module as the first step in solving a particular
programming problem. If you need to run a script continually
processing the same kind of input, it might pay to cobble together
some code that does EXACTLY what you need, no more and no less, that
to use someone else's code.

I say this as a promiscuous user of CPAN modules -- hardly a week goes
by that I don't install a new module for one reason or another -- and
frequently I just look at the source, modify it to do what I need, and
don't use or require the module.

TIMTOWTDI, CC.
Reply With Quote
Alt Today
Advertising
 
and become member of Rhinocerus
Standard Sponsored Links

  #17 (permalink)  
Old 11-16-2011, 04:17 PM
ccc31807
Guest
 
Posts: n/a
Default Re: How to import only part of a large XML file?

On Nov 11, 7:11*pm, Dwight Army of Champions
<dwightarmyofchampi...@hotmail.com>
> <?xml version="1.0"?>
> <library>
> <book>
> * * * * <title>Dreamcatcher</title>
> * * * * <author>Stephen King</author>
> * * * * <genre>Horror</genre>
> * * * * <pages>899</pages>
> * * * * <price>23.99</price>
> * * * * <rating>5</rating>
> * * * * <publication_date>11/27/2001</publication_date>
> </book>

....
> </library>


If I had this kind of file, and it was a static file, I would read it
into some kind of database. If you used something like SQLite, you
could read it into a table <book> element by <book> element, and then
use normal SQL to munge your data.

Alternative, you could convert the file into CSV format, which in many
ways is a lot easier to handle than XML.

It strikes me that using XML for this kind of work is overkill, unless
you had a specific requirement to use XML. If you had to use XML it
might pay to learn a little XSLT and use that instead of Perl. Perl is
a great language for string processing, but in some cases XSLT works
better.

CC.
Reply With Quote
  #18 (permalink)  
Old 11-16-2011, 05:31 PM
Klaus
Guest
 
Posts: n/a
Default Re: How to import only part of a large XML file?

On 16 nov, 18:17, ccc31807 <carte...@gmail.com> wrote:
> On Nov 11, 7:11*pm, Dwight Army of Champions
> <dwightarmyofchampi...@hotmail.com>
>
> > <?xml version="1.0"?>
> > <library>
> > <book>
> > * * * * <title>Dreamcatcher</title>
> > * * * * <author>Stephen King</author>
> > * * * * <genre>Horror</genre>
> > * * * * <pages>899</pages>
> > * * * * <price>23.99</price>
> > * * * * <rating>5</rating>
> > * * * * <publication_date>11/27/2001</publication_date>
> > </book>

> ...
> > </library>

>
> If I had this kind of file, and it was a static file, I would read it
> into some kind of database. If you used something like SQLite, you
> could read it into a table <book> element by <book> element, and then
> use normal SQL to munge your data.
>
> Alternative, you could convert the file into CSV format, which in many
> ways is a lot easier to handle than XML.


Converting to CSV is as easy as:

use strict;
use warnings;

use XML::Reader;
use Text::CSV_XS;

my $rdr = XML::Reader->new('huge.xml', {mode => 'branches'},
{ root => '/library/book', branch => [
'/title',
'/author',
'/genre',
'/pages',
'/price',
'/rating',
'/publication_date',
]},
{ root => '/library/music', branch => [
'/title',
'/artist',
'/release_date',
'/label',
]});

my $csv = Text::CSV_XS->new({ sep_char => ',', binary => 1, eol =>
$/ });
open my $ofh, '>', 'out.csv' or die $!;

while ($rdr->iterate) {
$csv->print($ofh, [ ($rdr->rx == 0 ? 'book' : 'music'), $rdr-
>value ]);

}

close $ofh;
Reply With Quote
  #19 (permalink)  
Old 11-16-2011, 06:53 PM
Klaus
Guest
 
Posts: n/a
Default Re: How to import only part of a large XML file?

On 16 nov, 17:32, ccc31807 <carte...@gmail.com> wrote:
> On Nov 11, 5:39*pm, Dwight Army of Champions
>
> <dwightarmyofchampi...@hotmail.com> wrote:
> > I have a very large XML file that I want to load, but I don't want to
> > necessarily load the entire document; that takes too long. What I want
> > to do instead is only key/value pairs that meet certain criteria, like
> > only grab entries whose value fall within a certain date for a key
> > date_of_entry. Can I just use XML::Simple for this or do I need a
> > better module?

>
> This depends on the nature of your input. I do this kind of thing
> every day, and use a simple regular expression to filter the file. Of
> course, you still have to read every line of the file to make sure
> that you catch all of your intended targets, but you would have to do
> that anyway. This is the kind of task for which it's a lot easier to
> hand roll your own parser than it is to look for, evaluate, learn,
> install, and use some third party module. In my opinion anyway. For
> example:
>
> SCRIPT
> #! perl
> use warnings;
> use strict;
> my %filter;
> while (<DATA>)
> {
> * * next unless /\w/;
> * * chomp;
> * * if ($_ =~ m!<order>(\d+)</order>!)
> * * {
> * * * * my $key = $1;
> * * * * while (<DATA>)
> * * * * {
> * * * * * * last if $_ =~ m!</pres>!;
> * * * * * * next unless $_ =~ m!<last>(\w+)</last>!;
> * * * * * * $filter{$key} = $1;
> * * * * }
> * * }}
>
> print "Finished processing file\n";
> foreach my $key (sort keys %filter) { print "$key => $filter{$key}
> \n"; }
> exit(0);


Using XML::Reader, it's even easier:

use strict;
use warnings;

use XML::Reader;

my %filter;

my $rdr = XML::Reader->new(\*DATA,
{mode => 'branches'},
{ root => '/data/pres', branch => [
'/order',
'/last',
]});

while ($rdr->iterate) {
my ($order, $last) = $rdr->value;
$filter{$order} = $last;
}

print "Finished processing file\n";
foreach my $key (sort keys %filter) {
print "$key => $filter{$key}\n";
}

__DATA__
<data>
<pres>
<order>1</order>
<first>George</first>
<last>Washington</last>
<year>1788</year>
</pres>
<pres>
<order>2</order>
<first>John</first>
<last>Adams</last>
<year>1796</year>
</pres>
<pres>
<order>3</order>
<first>Thomas</first>
<last>Jefferson</last>
<year>1800</year>
</pres>
</data>
Reply With Quote
  #20 (permalink)  
Old 11-17-2011, 06:33 PM
Rainer Weikusat
Guest
 
Posts: n/a
Default Re: How to import only part of a large XML file?

ccc31807 <cartercc@gmail.com> writes:
> On Nov 16, 11:50*am, Rainer Weikusat <rweiku...@mssgmbh.com> wrote:
>> AFAIK, a well-formed XML file could have an order description looking like
>> this:
>>
>> <order
>>
>> >1</order

>>
>> meaning, it is not really possible to parse XML without doing a
>> character-by-character lexical analysis of the input data stream
>> first.

>
> As I said, it depends on the nature of your input. XML handles
> 'ragged' data as well as the kind of normalized data we would expect
> to use for an RDBMS. If you aren't sure of the format of your data,
> you obviously have to validate it somehow. Part of this might be
> removing whitespace at the beginning and ends of lines. Sometimes it
> might be removing newlines from several lines until you match some
> kind of closing tag.


The point I was trying to make is that the kind of input your (example) code
can deal with needs to follow the rules of a grammar which is a proper
subset of the XML grammar.
Reply With Quote
  #21 (permalink)  
Old 11-17-2011, 07:06 PM
ccc31807
Guest
 
Posts: n/a
Default Re: How to import only part of a large XML file?

On Nov 17, 2:33*pm, Rainer Weikusat <rweiku...@mssgmbh.com> wrote:
> The point I was trying to make is that the kind of input your (example) code
> can deal with needs to follow the rules of a grammar which is a proper
> subset of the XML grammar.


Yes, I understood your point. We all have to deal with messy data, and
faulty input will kill an application with no hope of recovery if you
don't deal with the possibility of corrupted data.

That said, if you are confident of the format of your input (as you
might have with an input file generated from a database) it might be
quicker and easier to hand roll your own.

If you have XML, you can use a SAX parser to process your input
element by element, and I assume that it would handle your whitespace
example without a problem.

I don't deal with XML much, and I really appreciate the post from
others that illustrate scripts with XML::Reader and the like. I didn't
have it but installed it yesterday, and have spend several hours
piddling with it.

CC.

Reply With Quote
 
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are Off
Refbacks are Off




All times are GMT. The time now is 12:31 PM.


Copyright ©2009

LinkBacks Enabled by vBSEO 3.3.0 RC2 © 2009, Crawlability, Inc.