Go Back   Rhinocerus > Newsgroup > Newsgroup comp.lang.java.* > Newsgroup comp.lang.java.programmer

Reply
 
Thread Tools Display Modes
  #1 (permalink)  
Old 02-10-2010, 10:28 AM
Michael Powe
Guest
 
Posts: n/a
Default Best Way to Process Large Text Files

Hello,

I am tasked with writing an application to process some large text
files, i.e. > 1 GB. The input will be csv and the output will be in the
format of an IIS web server log.

I've done this sort of thing before. In the past, I've just
brute-forced it, with a BufferedReader and BufferedWriter handling the
input/output line by line.

I have a little time to complete this project and I'd like to build
something more efficient, that won't peg the CPU for an hour.

My thought was to have a read thread and a write thread and create a
buffer into which some amount of input would be written; and then, when
a threshold was reached, the data would be written out.

Is this a good idea? Are there better ways to manage this?

And finally, I need pointers as to how I would create such a buffer.
The threaded read/write part I can do.

Thanks for any help.

mp

--
Michael Powe michael@trollope.org Naugatuck CT USA
Re graphics: A picture is worth 10K words -- but only those to describe
the picture. Hardly any sets of 10K words can be adequately described
with pictures.
Reply With Quote
Alt Today
Advertising
 
and become member of Rhinocerus
Standard Sponsored Links

  #2 (permalink)  
Old 02-10-2010, 11:42 AM
rossum
Guest
 
Posts: n/a
Default Re: Best Way to Process Large Text Files

On Wed, 10 Feb 2010 06:28:14 -0500, Michael Powe
<michael+gnus@trollope.org> wrote:

>Hello,
>
>I am tasked with writing an application to process some large text
>files, i.e. > 1 GB. The input will be csv and the output will be in the
>format of an IIS web server log.
>
>I've done this sort of thing before. In the past, I've just
>brute-forced it, with a BufferedReader and BufferedWriter handling the
>input/output line by line.
>
>I have a little time to complete this project and I'd like to build
>something more efficient, that won't peg the CPU for an hour.
>
>My thought was to have a read thread and a write thread and create a
>buffer into which some amount of input would be written; and then, when
>a threshold was reached, the data would be written out.
>
>Is this a good idea? Are there better ways to manage this?
>
>And finally, I need pointers as to how I would create such a buffer.
>The threaded read/write part I can do.
>
>Thanks for any help.
>
>mp

If the innput is a CSV file then the logical unit is presumably a
record, either as a line of text or (partly) processed.

Create a queue. The read process adds records to the queue. The
write process pulls records off the queue.

rossum

Reply With Quote
  #3 (permalink)  
Old 02-10-2010, 05:48 PM
Tom Anderson
Guest
 
Posts: n/a
Default Re: Best Way to Process Large Text Files

On Wed, 10 Feb 2010, Michael Powe wrote:

> My thought was to have a read thread and a write thread and create a
> buffer into which some amount of input would be written; and then, when
> a threshold was reached, the data would be written out.
>
> Is this a good idea?


I'm slightly skeptical. If the processing is simple, then most of the time
will be spend doing IO even with a simple implementation. Adding threads
to overlap IO and processing might not be a big win. You could try writing
a sequential version of the program (with sufficiently large buffers - a
few megabytes, maybe?), then measuring how fast it runs - if the total
input and output data rate is close to your storage subsystem's capacity,
then no amount of programming cleverness will make it much faster.

If, OTOH, there's significant headroom above the rate you reach, then
using threads as you describe would be a good thing to try. Either that or
non-blocking IO via the NIO package, but i think you'd get decent results
from threads.

> And finally, I need pointers as to how I would create such a buffer. The
> threaded read/write part I can do.


You could try java.io.PipedInputStream and PipedOutputStream. If you want
a bigger buffer, you could grab the code for these from OpenJDK and modify
it. Mind you, circular buffers are a pretty standard bit of programming,
so there will be dozens of other implementations and descriptions out
there on the web.

tom

--
It's rare that you're simply presented with a knob whose only two
positions are "Make History" and "Flee Your Glorious Destiny." --
Tycho Brahae
Reply With Quote
  #4 (permalink)  
Old 02-10-2010, 06:13 PM
Roedy Green
Guest
 
Posts: n/a
Default Re: Best Way to Process Large Text Files

On Wed, 10 Feb 2010 06:28:14 -0500, Michael Powe
<michael+gnus@trollope.org> wrote, quoted or indirectly quoted someone
who said :

>I've done this sort of thing before. In the past, I've just
>brute-forced it, with a BufferedReader and BufferedWriter handling the
>input/output line by line.


There is quite a bit of CPU work parsing a CSV file. Try
http://mindprod.com/products1.html#CSV
and give it a 64K buffer before you go to a lot of work cooking up
something exotic.
--
Roedy Green Canadian Mind Products
http://mindprod.com

Every compilable program in a sense works. The problem is with your unrealistic expections on what it will do.
Reply With Quote
  #5 (permalink)  
Old 02-10-2010, 10:59 PM
EJP
Guest
 
Posts: n/a
Default Re: Best Way to Process Large Text Files

On 10/02/2010 10:28 PM, Michael Powe wrote:
> I have a little time to complete this project and I'd like to build
> something more efficient, that won't peg the CPU for an hour.


Fix your code. It only takes a few seconds to read a file of practically
any size. In my experience the only way you can take an hour to process
any file on modern equipment is if you read the whole file into memory
via concatenation of Strings and then process it, which is the wrong
approach from every possible point of view. Process a line at a time.
Reply With Quote
  #6 (permalink)  
Old 02-11-2010, 01:07 AM
Arne Vajhøj
Guest
 
Posts: n/a
Default Re: Best Way to Process Large Text Files

On 10-02-2010 18:59, EJP wrote:
> On 10/02/2010 10:28 PM, Michael Powe wrote:
>> I have a little time to complete this project and I'd like to build
>> something more efficient, that won't peg the CPU for an hour.

>
> Fix your code. It only takes a few seconds to read a file of practically
> any size. In my experience the only way you can take an hour to process
> any file on modern equipment is if you read the whole file into memory
> via concatenation of Strings and then process it, which is the wrong
> approach from every possible point of view. Process a line at a time.


I agree completely with your point.

Huge files may still take time to read from the disk though.

Arne

Reply With Quote
  #7 (permalink)  
Old 02-11-2010, 02:45 AM
markspace
Guest
 
Posts: n/a
Default Re: Best Way to Process Large Text Files

Arne Vajhøj wrote:
> On 10-02-2010 18:59, EJP wrote:
>> On 10/02/2010 10:28 PM, Michael Powe wrote:
>>> I have a little time to complete this project and I'd like to build
>>> something more efficient, that won't peg the CPU for an hour.

>>
>> Fix your code. It only takes a few seconds to read a file of practically
>> any size. In my experience the only way you can take an hour to process
>> any file on modern equipment is if you read the whole file into memory
>> via concatenation of Strings and then process it, which is the wrong
>> approach from every possible point of view. Process a line at a time.

>
> I agree completely with your point.
>
> Huge files may still take time to read from the disk though.



The OP said > 1 GB, so we don't know if he meant up to 2 GB or if he's
talking about 10 GB or 100 GB or 1000 GB. So a little clarification
here would help, I think.
Reply With Quote
  #8 (permalink)  
Old 02-11-2010, 09:26 PM
Roedy Green
Guest
 
Posts: n/a
Default Re: Best Way to Process Large Text Files

On Wed, 10 Feb 2010 06:28:14 -0500, Michael Powe
<michael+gnus@trollope.org> wrote, quoted or indirectly quoted someone
who said :

>
>I am tasked with writing an application to process some large text
>files, i.e. > 1 GB. The input will be csv and the output will be in the
>format of an IIS web server log.


Do a little benchmark where you do nothing but read the giant file.

If all the time is spend processing the file, there not much point in
fancy stuff to read the file files.

Usually when things slow to a crawl it is because you have filled RAM
with objects you don't need, and that forces very frequent GC.

Before you start optimising, you first have to prove where the
bottlenecks are.
--
Roedy Green Canadian Mind Products
http://mindprod.com

Every compilable program in a sense works. The problem is with your unrealistic expections on what it will do.
Reply With Quote
  #9 (permalink)  
Old 02-12-2010, 02:05 AM
Alex
Guest
 
Posts: n/a
Default Re: Best Way to Process Large Text Files

On Feb 10, 9:07*pm, Arne Vajhøj <a...@vajhoej.dk> wrote:
> Huge files may still take time to read from the disk though.

A lot of time. I tried my skills in Netflix $1,000,000 contest... on
my computer it took 15 minutes to read their entire data (for example
to do some calculation). I had compressed it to the zip archive and
then reading it and uncompress with the same calculation took only 3
minutes.

Reply With Quote
  #10 (permalink)  
Old 02-12-2010, 09:39 PM
Daniel Pitts
Guest
 
Posts: n/a
Default Re: Best Way to Process Large Text Files

On 2/10/2010 3:28 AM, Michael Powe wrote:
> Hello,
>
> I am tasked with writing an application to process some large text
> files, i.e.> 1 GB. The input will be csv and the output will be in the
> format of an IIS web server log.
>
> I've done this sort of thing before. In the past, I've just
> brute-forced it, with a BufferedReader and BufferedWriter handling the
> input/output line by line.
>
> I have a little time to complete this project and I'd like to build
> something more efficient, that won't peg the CPU for an hour.
>
> My thought was to have a read thread and a write thread and create a
> buffer into which some amount of input would be written; and then, when
> a threshold was reached, the data would be written out.
>
> Is this a good idea? Are there better ways to manage this?
>
> And finally, I need pointers as to how I would create such a buffer.
> The threaded read/write part I can do.
>
> Thanks for any help.
>
> mp
>

Depending on how processor intensive the transformation is, you might
not gain anything from threading.

If you are using regex to parse, you may be better off optimizing your
regexs, or using hand-coded parsing instead. A naive regex which "works"
may have some performance problems. Use greedy matching where
appropriate is one way to improve performance.

--
Daniel Pitts' Tech Blog: <http://virtualinfinity.net/wordpress/>
Reply With Quote
 
Reply

Popular Tags in the Forum
files, large, process, text

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are Off
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: ODS Layout Question Randy Herbison Newsgroup comp.soft-sys.sas 0 08-11-2009 09:32 PM
Re: How to read multiple text files using macro Stephen Bittner Newsgroup comp.soft-sys.sas 0 03-12-2008 11:55 AM
Converting ArcView Using Proc MapImport Kenneth Karan Newsgroup comp.soft-sys.sas 0 08-29-2006 03:14 AM
Re: How to change .xls files into .txt files in SAS SUBSCRIBE SAS-L Chandra Gadde Newsgroup comp.soft-sys.sas 0 06-23-2006 07:18 PM
LayPerson's-Macro 101: Re: Passing by reference Terjeson, Mark Newsgroup comp.soft-sys.sas 0 12-13-2005 04:33 PM



All times are GMT. The time now is 04:28 PM.


Copyright ©2009

LinkBacks Enabled by vBSEO 3.3.0 RC2 © 2009, Crawlability, Inc.