Go Back   Rhinocerus > Newsgroup > Newsgroup comp.lang.java.* > Newsgroup comp.lang.java.programmer



Reply
 
Thread Tools Display Modes
  #1 (permalink)  
Old 02-07-2010, 05:59 PM
Roedy Green
Guest
 
Posts: n/a
Default large XML files

It seems to me the usual XML tools in Java load the entire XML file
into RAM. Are there any tools that process sequentially, bringing in
only a chunk at a time so you could handle really fat files.
--
Roedy Green Canadian Mind Products
http://mindprod.com

Every compilable program in a sense works. The problem is with your unrealistic expections on what it will do.
Reply With Quote
Alt Today
Advertising
Google Adsense
 
and become member of Rhinocerus
Standard Sponsored Links

  #2 (permalink)  
Old 02-07-2010, 06:13 PM
Donkey Hottie
Guest
 
Posts: n/a
Default Re: large XML files

On 7.2.2010 19:59, Roedy Green wrote:
> It seems to me the usual XML tools in Java load the entire XML file
> into RAM. Are there any tools that process sequentially, bringing in
> only a chunk at a time so you could handle really fat files.


Java has tools for such XML files. SAX processes XML so that it does not
need to load it all to memory.

--
Good day for a change of scene. Repaper the bedroom wall.
Reply With Quote
  #3 (permalink)  
Old 02-07-2010, 06:14 PM
John B. Matthews
Guest
 
Posts: n/a
Default Re: large XML files

In article <3qvtm5h7bf92h7nos1nms4oc4m6cd203d6@4ax.com>,
Roedy Green <see_website@mindprod.com.invalid> wrote:

> It seems to me the usual XML tools in Java load the entire XML file
> into RAM. Are there any tools that process sequentially, bringing in
> only a chunk at a time so you could handle really fat files.



I thought that was a principal advantage of the Simple API For XML (SAX)
model, at least in principle. :-)

<http://www.totheriver.com/learn/xml/xmltutorial.html>

--
John B. Matthews
trashgod at gmail dot com
<http://sites.google.com/site/drjohnbmatthews>
Reply With Quote
  #4 (permalink)  
Old 02-07-2010, 06:14 PM
Peter Duniho
Guest
 
Posts: n/a
Default Re: large XML files

Roedy Green wrote:
> It seems to me the usual XML tools in Java load the entire XML file
> into RAM. Are there any tools that process sequentially, bringing in
> only a chunk at a time so you could handle really fat files.


Sounds like you want the XMLStreamReader interface:
http://java.sun.com/javase/6/docs/ap...eamReader.html

I haven't used the Java version myself (there's a similar type in .NET),
and haven't looked closed to determine the specifics. But I presume
there's a way to get an implementation of the interface (looks like
XMLInputFactory is the way to go).

Of course, if per a previous discussion you're stuck on Java 1.5, this
is unavailable to you. But otherwise, you should find it exactly what
you're asking for.

Pete
Reply With Quote
  #5 (permalink)  
Old 02-07-2010, 06:20 PM
Donkey Hottie
Guest
 
Posts: n/a
Default Re: large XML files

On 7.2.2010 20:14, Peter Duniho wrote:
> Roedy Green wrote:
>> It seems to me the usual XML tools in Java load the entire XML file
>> into RAM. Are there any tools that process sequentially, bringing in
>> only a chunk at a time so you could handle really fat files.

>
> Sounds like you want the XMLStreamReader interface:
> http://java.sun.com/javase/6/docs/ap...eamReader.html
>
> I haven't used the Java version myself (there's a similar type in .NET),
> and haven't looked closed to determine the specifics. But I presume
> there's a way to get an implementation of the interface (looks like
> XMLInputFactory is the way to go).
>
> Of course, if per a previous discussion you're stuck on Java 1.5, this
> is unavailable to you. But otherwise, you should find it exactly what
> you're asking for.
>
> Pete


SAX interface works fine even with Java 1.4, and it does what Roedy wants.


--
Good day for a change of scene. Repaper the bedroom wall.
Reply With Quote
  #6 (permalink)  
Old 02-07-2010, 08:04 PM
Arne Vajhøj
Guest
 
Posts: n/a
Default Re: large XML files

On 07-02-2010 12:59, Roedy Green wrote:
> It seems to me the usual XML tools in Java load the entire XML file
> into RAM.


????

W3CDOM and JAXB do load all data in memory.

SAX and StAX do not load all data in memory.

Arne
Reply With Quote
  #7 (permalink)  
Old 02-07-2010, 08:31 PM
Lew
Guest
 
Posts: n/a
Default Re: large XML files

On 2/7/2010 1:20 PM, Donkey Hottie wrote:
> On 7.2.2010 20:14, Peter Duniho wrote:
>> Roedy Green wrote:
>>> It seems to me the usual XML tools in Java load the entire XML file
>>> into RAM. Are there any tools that process sequentially, bringing in
>>> only a chunk at a time so you could handle really fat files.

>>
>> Sounds like you want the XMLStreamReader interface:
>> http://java.sun.com/javase/6/docs/ap...eamReader.html
>>
>> I haven't used the Java version myself (there's a similar type in .NET),
>> and haven't looked closed to determine the specifics. But I presume
>> there's a way to get an implementation of the interface (looks like
>> XMLInputFactory is the way to go).
>>
>> Of course, if per a previous discussion you're stuck on Java 1.5, this
>> is unavailable to you. But otherwise, you should find it exactly what
>> you're asking for.
>>
>> Pete

>
> SAX interface works fine even with Java 1.4, and it does what Roedy wants.


It's been around since Java 1.2; it better work with 1.4.

--
Lew

Reply With Quote
  #8 (permalink)  
Old 02-07-2010, 08:32 PM
Lew
Guest
 
Posts: n/a
Default Re: large XML files

Roedy Green wrote:
>> It seems to me the usual XML tools in Java load the entire XML file
>> into RAM. Are there any tools that process sequentially, bringing in
>> only a chunk at a time so you could handle really fat files.


Donkey Hottie wrote:
> Java has tools for such XML files. SAX processes XML so that it does not
> need to load it all to memory.


I first used SAX for XML parsing in early 1999. There's nothing new
about it.

SAX, and its equally handy StAX sibling, are perfect for single-pass,
very-high-speed, memory-parsimonious handling of XML documents.

Roedy has an interesting definition of "usual XML tools", since he's
ignoring two out of three interfaces, including one that's been around
nearly forever.

--
Lew
Reply With Quote
  #9 (permalink)  
Old 02-07-2010, 09:35 PM
Arne Vajhøj
Guest
 
Posts: n/a
Default Re: large XML files

On 07-02-2010 15:31, Lew wrote:
> On 2/7/2010 1:20 PM, Donkey Hottie wrote:
>> On 7.2.2010 20:14, Peter Duniho wrote:
>>> Roedy Green wrote:
>>>> It seems to me the usual XML tools in Java load the entire XML file
>>>> into RAM. Are there any tools that process sequentially, bringing in
>>>> only a chunk at a time so you could handle really fat files.
>>>
>>> Sounds like you want the XMLStreamReader interface:
>>> http://java.sun.com/javase/6/docs/ap...eamReader.html
>>>
>>>
>>> I haven't used the Java version myself (there's a similar type in .NET),
>>> and haven't looked closed to determine the specifics. But I presume
>>> there's a way to get an implementation of the interface (looks like
>>> XMLInputFactory is the way to go).
>>>
>>> Of course, if per a previous discussion you're stuck on Java 1.5, this
>>> is unavailable to you. But otherwise, you should find it exactly what
>>> you're asking for.
>>>
>>> Pete

>>
>> SAX interface works fine even with Java 1.4, and it does what Roedy
>> wants.

>
> It's been around since Java 1.2; it better work with 1.4.


Yes and no.

SAX was added to Java API in 1.4.

JAXP API including SAX existed earlier than Java 1.4 and
libraries implementing it could be separately downloaded.

I have done the latter for Java 1.3 and it may have
existed already for 1.2.

Arne



Reply With Quote
  #10 (permalink)  
Old 02-07-2010, 09:37 PM
Mike Schilling
Guest
 
Posts: n/a
Default Re: large XML files

Arne Vajhøj wrote:
> On 07-02-2010 12:59, Roedy Green wrote:
>> It seems to me the usual XML tools in Java load the entire XML file
>> into RAM.

>
> ????
>
> W3CDOM and JAXB do load all data in memory.
>
> SAX and StAX do not load all data in memory.


If you use XSLT to process an XML file, it has to keep a complete
representation of the resulting XML document into memory, since an XSLT
transformation can include XPath expressions, and XPath can in principle
access anything in the dociument. This is true even if the input to XSLT is
a SAXSource.


Reply With Quote
  #11 (permalink)  
Old 02-07-2010, 10:00 PM
Arne Vajhøj
Guest
 
Posts: n/a
Default Re: large XML files

On 07-02-2010 16:37, Mike Schilling wrote:
> Arne Vajhøj wrote:
>> On 07-02-2010 12:59, Roedy Green wrote:
>>> It seems to me the usual XML tools in Java load the entire XML file
>>> into RAM.

>>
>> ????
>>
>> W3CDOM and JAXB do load all data in memory.
>>
>> SAX and StAX do not load all data in memory.

>
> If you use XSLT to process an XML file, it has to keep a complete
> representation of the resulting XML document into memory, since an XSLT
> transformation can include XPath expressions, and XPath can in principle
> access anything in the dociument. This is true even if the input to XSLT is
> a SAXSource.


True.

But that problem is very hard to solve.

Arne
Reply With Quote
  #12 (permalink)  
Old 02-07-2010, 10:25 PM
Tom Anderson
Guest
 
Posts: n/a
Default Re: large XML files

On Sun, 7 Feb 2010, Mike Schilling wrote:

> Arne Vajh?j wrote:
>> On 07-02-2010 12:59, Roedy Green wrote:
>>> It seems to me the usual XML tools in Java load the entire XML file
>>> into RAM.

>>
>> ????
>>
>> W3CDOM and JAXB do load all data in memory.
>>
>> SAX and StAX do not load all data in memory.

>
> If you use XSLT to process an XML file, it has to keep a complete
> representation of the resulting XML document into memory, since an XSLT
> transformation can include XPath expressions, and XPath can in principle
> access anything in the dociument. This is true even if the input to
> XSLT is a SAXSource.


Weeeellll, kinda. Some XSLTs will require the whole document to be held in
memory. But it is possible to process some XSLTs in a streaming or
streaming-ish manner (where elements are held in memory, but only a subset
at a time). There's nothing stopping an XSLT processor compiling such
XSLTs into a form which does just that. Whether any actually do, i don't
know.

A while ago, i read about a streaming XPath processor. It couldn't handle
all XPaths in a streaming manner, so it had to fall back to searching an
in-memory tree where that was the case, but many common XPaths can be
handled streamingly. For instance, something like:

//order[@id='99']/order-item

Could be. You run the parse, and maintain the current stack of elements in
memory - all the elements enclosing the current parse point, IYSWIM. Then
you just look at the top of the stack at every point to see if it's an
order-item, then if it is, look back to see if the enclosing order has an
id of 99. You could probably do it more efficiently than that, but that's
one way you could do it. Something like this:

//order[customer[@id='99']]/order-item

Is more challenging, and requires a more sophisticated evaluation strategy
- you might need to read in a whole order, search it for matching
order-items, then throw it away and move on to the next one. Or, if you
knew from the DTD that the customer element had to come before any
order-items in an order, you could build a state machine that could decide
that it was inside a matching order, and then report all order-items.

Anyway, all speculation, but it's interesting stuff!

tom

--
Dreams are not covered by any laws. They can be about anything. --
Cmdr Zorg
Reply With Quote
  #13 (permalink)  
Old 02-07-2010, 10:26 PM
Tom Anderson
Guest
 
Posts: n/a
Default Re: large XML files

On Sun, 7 Feb 2010, Roedy Green wrote:

> It seems to me the usual XML tools in Java load the entire XML file into
> RAM. Are there any tools that process sequentially, bringing in only a
> chunk at a time so you could handle really fat files.


What do you mean by 'tools'?

tom

--
Dreams are not covered by any laws. They can be about anything. --
Cmdr Zorg
Reply With Quote
  #14 (permalink)  
Old 02-08-2010, 03:12 AM
Mike Schilling
Guest
 
Posts: n/a
Default Re: large XML files

Tom Anderson wrote:
> On Sun, 7 Feb 2010, Mike Schilling wrote:
>
>> Arne Vajh?j wrote:
>>> On 07-02-2010 12:59, Roedy Green wrote:
>>>> It seems to me the usual XML tools in Java load the entire XML file
>>>> into RAM.
>>>
>>> ????
>>>
>>> W3CDOM and JAXB do load all data in memory.
>>>
>>> SAX and StAX do not load all data in memory.

>>
>> If you use XSLT to process an XML file, it has to keep a complete
>> representation of the resulting XML document into memory, since an
>> XSLT transformation can include XPath expressions, and XPath can in
>> principle access anything in the dociument. This is true even if
>> the input to XSLT is a SAXSource.

>
> Weeeellll, kinda. Some XSLTs will require the whole document to be
> held in memory. But it is possible to process some XSLTs in a
> streaming or streaming-ish manner (where elements are held in memory,
> but only a subset at a time). There's nothing stopping an XSLT
> processor compiling such XSLTs into a form which does just that.
> Whether any actually do, i don't know.


Xalan (the XSLT processor in the JDK), doesn't.


Reply With Quote
  #15 (permalink)  
Old 02-08-2010, 03:13 AM
Arne Vajhøj
Guest
 
Posts: n/a
Default Re: large XML files

On 07-02-2010 17:25, Tom Anderson wrote:
> On Sun, 7 Feb 2010, Mike Schilling wrote:
>> Arne Vajh?j wrote:
>>> On 07-02-2010 12:59, Roedy Green wrote:
>>>> It seems to me the usual XML tools in Java load the entire XML file
>>>> into RAM.
>>>
>>> ????
>>>
>>> W3CDOM and JAXB do load all data in memory.
>>>
>>> SAX and StAX do not load all data in memory.

>>
>> If you use XSLT to process an XML file, it has to keep a complete
>> representation of the resulting XML document into memory, since an
>> XSLT transformation can include XPath expressions, and XPath can in
>> principle access anything in the dociument. This is true even if the
>> input to XSLT is a SAXSource.

>
> Weeeellll, kinda. Some XSLTs will require the whole document to be held
> in memory. But it is possible to process some XSLTs in a streaming or
> streaming-ish manner (where elements are held in memory, but only a
> subset at a time). There's nothing stopping an XSLT processor compiling
> such XSLTs into a form which does just that. Whether any actually do, i
> don't know.
>
> A while ago, i read about a streaming XPath processor. It couldn't
> handle all XPaths in a streaming manner, so it had to fall back to
> searching an in-memory tree where that was the case, but many common
> XPaths can be handled streamingly. For instance, something like:
>
> //order[@id='99']/order-item
>
> Could be. You run the parse, and maintain the current stack of elements
> in memory - all the elements enclosing the current parse point, IYSWIM.
> Then you just look at the top of the stack at every point to see if it's
> an order-item, then if it is, look back to see if the enclosing order
> has an id of 99. You could probably do it more efficiently than that,
> but that's one way you could do it. Something like this:
>
> //order[customer[@id='99']]/order-item
>
> Is more challenging, and requires a more sophisticated evaluation
> strategy - you might need to read in a whole order, search it for
> matching order-items, then throw it away and move on to the next one.
> Or, if you knew from the DTD that the customer element had to come
> before any order-items in an order, you could build a state machine that
> could decide that it was inside a matching order, and then report all
> order-items.
>
> Anyway, all speculation, but it's interesting stuff!


Interesting.

But for writing code today that use the standard XML libraries,
then assuming that XSLT would read it all into memory would be
a safe assumption.

Arne

Reply With Quote
 
Reply

Popular Tags in the Forum
files, large, xml

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are Off
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Tricks Of The Trade: How To Compile Large Program On Two VeryDifferent Compilers ChristopherL Newsgroup comp.lang.ada 21 11-11-2009 08:51 PM
Re: attaching large XML files to outlook messages Mary Newsgroup comp.soft-sys.sas 0 08-12-2008 08:21 PM
setting internal order for proc tabulate rss Newsgroup comp.soft-sys.sas 7 12-06-2006 12:29 AM
Efficient, fast table lookup (AKA Paul Dorfman, where are you? :-) ) Scott Bass Newsgroup comp.soft-sys.sas 2 08-12-2005 11:12 PM
Re: possible in one step? Michael Murff Newsgroup comp.soft-sys.sas 3 01-26-2005 09:33 PM



Language 1 | C | C++ | Php | Python | Lisp | Perl | Ruby | Java | Pascal | Basic | Language 2 | Databases | Oracle | Mysql | Access | Drupal
All times are GMT. The time now is 05:24 PM.


Copyright ©2009

LinkBacks Enabled by vBSEO 3.3.0 RC2 © 2009, Crawlability, Inc.