|
|||
|
On Nov 16, 11:50*am, Rainer Weikusat <rweiku...@mssgmbh.com> wrote:
> AFAIK, a well-formed XML file could have an order description looking like > this: > > <order > > >1</order > > meaning, it is not really possible to parse XML without doing a > character-by-character lexical analysis of the input data stream > first. As I said, it depends on the nature of your input. XML handles 'ragged' data as well as the kind of normalized data we would expect to use for an RDBMS. If you aren't sure of the format of your data, you obviously have to validate it somehow. Part of this might be removing whitespace at the beginning and ends of lines. Sometimes it might be removing newlines from several lines until you match some kind of closing tag. I don't advocate reinventing wheels. I also don't advocate searching for a CPAN module as the first step in solving a particular programming problem. If you need to run a script continually processing the same kind of input, it might pay to cobble together some code that does EXACTLY what you need, no more and no less, that to use someone else's code. I say this as a promiscuous user of CPAN modules -- hardly a week goes by that I don't install a new module for one reason or another -- and frequently I just look at the source, modify it to do what I need, and don't use or require the module. TIMTOWTDI, CC. |
|
|
||||
|
||||
|
|
|
|||
|
On Nov 11, 7:11*pm, Dwight Army of Champions
<dwightarmyofchampi...@hotmail.com> > <?xml version="1.0"?> > <library> > <book> > * * * * <title>Dreamcatcher</title> > * * * * <author>Stephen King</author> > * * * * <genre>Horror</genre> > * * * * <pages>899</pages> > * * * * <price>23.99</price> > * * * * <rating>5</rating> > * * * * <publication_date>11/27/2001</publication_date> > </book> .... > </library> If I had this kind of file, and it was a static file, I would read it into some kind of database. If you used something like SQLite, you could read it into a table <book> element by <book> element, and then use normal SQL to munge your data. Alternative, you could convert the file into CSV format, which in many ways is a lot easier to handle than XML. It strikes me that using XML for this kind of work is overkill, unless you had a specific requirement to use XML. If you had to use XML it might pay to learn a little XSLT and use that instead of Perl. Perl is a great language for string processing, but in some cases XSLT works better. CC. |
|
|||
|
On 16 nov, 18:17, ccc31807 <carte...@gmail.com> wrote:
> On Nov 11, 7:11*pm, Dwight Army of Champions > <dwightarmyofchampi...@hotmail.com> > > > <?xml version="1.0"?> > > <library> > > <book> > > * * * * <title>Dreamcatcher</title> > > * * * * <author>Stephen King</author> > > * * * * <genre>Horror</genre> > > * * * * <pages>899</pages> > > * * * * <price>23.99</price> > > * * * * <rating>5</rating> > > * * * * <publication_date>11/27/2001</publication_date> > > </book> > ... > > </library> > > If I had this kind of file, and it was a static file, I would read it > into some kind of database. If you used something like SQLite, you > could read it into a table <book> element by <book> element, and then > use normal SQL to munge your data. > > Alternative, you could convert the file into CSV format, which in many > ways is a lot easier to handle than XML. Converting to CSV is as easy as: use strict; use warnings; use XML::Reader; use Text::CSV_XS; my $rdr = XML::Reader->new('huge.xml', {mode => 'branches'}, { root => '/library/book', branch => [ '/title', '/author', '/genre', '/pages', '/price', '/rating', '/publication_date', ]}, { root => '/library/music', branch => [ '/title', '/artist', '/release_date', '/label', ]}); my $csv = Text::CSV_XS->new({ sep_char => ',', binary => 1, eol => $/ }); open my $ofh, '>', 'out.csv' or die $!; while ($rdr->iterate) { $csv->print($ofh, [ ($rdr->rx == 0 ? 'book' : 'music'), $rdr- >value ]); } close $ofh; |
|
|||
|
On 16 nov, 17:32, ccc31807 <carte...@gmail.com> wrote:
> On Nov 11, 5:39*pm, Dwight Army of Champions > > <dwightarmyofchampi...@hotmail.com> wrote: > > I have a very large XML file that I want to load, but I don't want to > > necessarily load the entire document; that takes too long. What I want > > to do instead is only key/value pairs that meet certain criteria, like > > only grab entries whose value fall within a certain date for a key > > date_of_entry. Can I just use XML::Simple for this or do I need a > > better module? > > This depends on the nature of your input. I do this kind of thing > every day, and use a simple regular expression to filter the file. Of > course, you still have to read every line of the file to make sure > that you catch all of your intended targets, but you would have to do > that anyway. This is the kind of task for which it's a lot easier to > hand roll your own parser than it is to look for, evaluate, learn, > install, and use some third party module. In my opinion anyway. For > example: > > SCRIPT > #! perl > use warnings; > use strict; > my %filter; > while (<DATA>) > { > * * next unless /\w/; > * * chomp; > * * if ($_ =~ m!<order>(\d+)</order>!) > * * { > * * * * my $key = $1; > * * * * while (<DATA>) > * * * * { > * * * * * * last if $_ =~ m!</pres>!; > * * * * * * next unless $_ =~ m!<last>(\w+)</last>!; > * * * * * * $filter{$key} = $1; > * * * * } > * * }} > > print "Finished processing file\n"; > foreach my $key (sort keys %filter) { print "$key => $filter{$key} > \n"; } > exit(0); Using XML::Reader, it's even easier: use strict; use warnings; use XML::Reader; my %filter; my $rdr = XML::Reader->new(\*DATA, {mode => 'branches'}, { root => '/data/pres', branch => [ '/order', '/last', ]}); while ($rdr->iterate) { my ($order, $last) = $rdr->value; $filter{$order} = $last; } print "Finished processing file\n"; foreach my $key (sort keys %filter) { print "$key => $filter{$key}\n"; } __DATA__ <data> <pres> <order>1</order> <first>George</first> <last>Washington</last> <year>1788</year> </pres> <pres> <order>2</order> <first>John</first> <last>Adams</last> <year>1796</year> </pres> <pres> <order>3</order> <first>Thomas</first> <last>Jefferson</last> <year>1800</year> </pres> </data> |
|
|||
|
ccc31807 <cartercc@gmail.com> writes:
> On Nov 16, 11:50*am, Rainer Weikusat <rweiku...@mssgmbh.com> wrote: >> AFAIK, a well-formed XML file could have an order description looking like >> this: >> >> <order >> >> >1</order >> >> meaning, it is not really possible to parse XML without doing a >> character-by-character lexical analysis of the input data stream >> first. > > As I said, it depends on the nature of your input. XML handles > 'ragged' data as well as the kind of normalized data we would expect > to use for an RDBMS. If you aren't sure of the format of your data, > you obviously have to validate it somehow. Part of this might be > removing whitespace at the beginning and ends of lines. Sometimes it > might be removing newlines from several lines until you match some > kind of closing tag. The point I was trying to make is that the kind of input your (example) code can deal with needs to follow the rules of a grammar which is a proper subset of the XML grammar. |
|
|||
|
On Nov 17, 2:33*pm, Rainer Weikusat <rweiku...@mssgmbh.com> wrote:
> The point I was trying to make is that the kind of input your (example) code > can deal with needs to follow the rules of a grammar which is a proper > subset of the XML grammar. Yes, I understood your point. We all have to deal with messy data, and faulty input will kill an application with no hope of recovery if you don't deal with the possibility of corrupted data. That said, if you are confident of the format of your input (as you might have with an input file generated from a database) it might be quicker and easier to hand roll your own. If you have XML, you can use a SAX parser to process your input element by element, and I assume that it would handle your whitespace example without a problem. I don't deal with XML much, and I really appreciate the post from others that illustrate scripts with XML::Reader and the like. I didn't have it but installed it yesterday, and have spend several hours piddling with it. CC. |
|
|
![]() |
| Thread Tools | |
| Display Modes | |
|
|