Re: UNIX datastep question
> -----Original Message-----
> From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of
> Sent: Tuesday, June 27, 2006 6:24 PM
> To: SAS-L@LISTSERV.UGA.EDU
> Subject: Re: UNIX datastep question
> Hi, again,
> Wow, I'm very grateful for all the assistance. I'll admit that I am a
> coder, relatively, and especially fresh to UNIX. Yous guys have given me
> lots to play with. I intend to try it all to find the best method for
> future work. I'm running my code overnight with the simple suggestions to
> see if it works. Then I plan to subset the data to make sure my search
> values are right. Then run the whole she-bang. In my playtime I'm going
> play around with everyone's suggestions.
> I'll be getting into bigger files later on so I will need to know how to
> these things. The most foreign thing was using view, but it sounds useful
> so I'll get to know it.
> Paul, I can do some of my work using by instead of class, but not all of
> like the logits.
> Thanks, again, to everyone. Don't take this as a hint to stop the
> suggestions. I just wanted to show my appreciation even if I don't
> to everyone.
> -----Original Message-----
> From: Choate, Paul@DDS [mailtochoate@DDS.CA.GOV]
> Sent: Tuesday, June 27, 2006 7:58 PM
> To: SAS-L@LISTSERV.UGA.EDU; Jennifer Sabatier
> Cc: William W. Viergever
> Subject: RE: UNIX datastep question
> Hello again Jennifer -
> I most unreservedly agree with William's comments, compressing will slow
> things down, so you best test on a subset first and see if the space
> gain is worth the processing overhead. Subsetting a test dataset with
> "Options obs=1000000;" and then "Options obs=1000000 compress=yes;"
> might work nicely. The compression results will be posted in your log.
> My work commonly involves moderate sized datasets with many diagnoses.
> From experience on each record two or three your 15 will be non-blank,
> leaving the empty dozen or so choking the pipes with dead space. In
> such cases compression will be significant.
> That said, the performance may very well be an issue, and as a bottle of
> wine may be at stake here in sunny Sactomatoe, I fully commend any and
> every idea my most wise colleague Mr. Viergever has to proffer!
> For example - testing this code showed a 28% performance reduction for a
> 68% space gain:
> options compress=no; *or compress=yes;
> data test;
> array dx(15) $6;
> do i = 1 to 1e6;
> do j = 1 to 15;
> if ranuni(123)<.1 then dx(j)='292.89'; else dx(j)='';
> If you really need the space then the compression may be worthwhile.
> Ian's sage comment re views also heeds consideration.
> A compression alternative in Windows that may be available in your UNIX
> (or not - you get some homework) is that a library can be defined
> pointing to a compressed folder. In this case the OS does the
> compressing, and it will be better optimized than what SAS will do.
> Using the same code above without the compression option produced better
> compression (81%) for a smaller performance loss (10%).
> One more comment to add - If your data is to be stored and used
> repeatedly, then if it isn't already, sort (or index) your data on a
> fairly granular variable. In your procedures process it using "by"
> groups, especially instead of class variables. This will speed your
> procedures and cause them to need less memory. Caveat - sorting
> requires about three times the amount of disk space as the original
> In my above example the second below is 20% faster:
> proc freq;
> tables j*dx4;
> proc freq;
> by j;
> tables dx4;
> Paul Choate
> DDS Data Extraction
> (916) 654-2160
> > -----Original Message-----
> > From: William W. Viergever [mailto:email@example.com]
> > Sent: Tuesday, June 27, 2006 3:24 PM
> > To: Choate, Paul@DDS; SAS-L@LISTSERV.UGA.EDU
> > Subject: Re: UNIX datastep question
> > jennifer:
> > i'll agree w/ my esteemed, fellow sacramentan, mr. choate - make your
> > flags character; you can do the math: 14B * 8 * 7 to get an idea why
> > your file grew so large
> > also: yes - use formats
> > as for compression, i'm not enough of expert to offer conclusive
> > evidence as to why NOT do that ... other than to presume you'll chew
> > up mega CPU cycles uncompressing that big chunk as you go through the
> > obs
> > however, i can say that i play w/ files larger than yours, on a PC
> > running Windoze XP, and have never needed to use compression. e.g.,
> > 12/01/2005 01:08 PM 30,308,729,856 ndc_35_file_claims.sas7bdat
> > 07/05/2005 02:41 PM 35,116,319,744 35_file_yrmonth_load.sas7bdat
> > and, although in many files, i've got another directory that i run
> > through regularly:
> > 137 File(s) 223,795,044,352 bytes
> > so, my advice: try paul's suggestions, sans the compression, first
> > ciao'
Out of curiosity, is your 14GB file already compressed? The reason I ask
is that you say this file has 14 billion people (records?). That would mean
only 1 byte per record unless the file was compressed, and clearly your
records contain many more bytes than that.
You have received much good advice already about saving space and improving
performance. I'll suggest a couple more things for you to consider.
It looks like you are creating your new variables and adding them to ALL the
variables that are in the original file. You might want to keep only the
variables that you are going to analyze. You resulting file will be much
smaller. If you do that, I second Ian's suggestion to add an ID variable.
It was suggested that you might want to sort the file if you are going to be
using it repeatedly. That is a good idea. I often sort AND index large
files. Indexes work great if you need rapid access to small numbers of
records from a file. However, if you need to read a whole file in some
order it will be best to have the file sorted in that order. Using an index
can really increase I/O if the physical order of the file is different from
the index ordering. If you can figure out how to do what you want without
sorting, then by all means avoid the sort. This brings me back to my
question about file compression. As was suggested you will need about 3
times the disk space to sort your file. If the file is compressed you will
need much more than three times the compressed file size.
You seem to have settled on analyzing one variable at a time. I am not
familiar with the dataset that you are using, but is that a good idea? Is
there some structure imposed on which diagnoses go in which field? Aren't
you losing some information if for one record a hyperlipidemia diagnosis is
in the first field and on another record there is a hyperlipidemia diagnosis
in the fifteenth field? Just a thought.
Hope this is helpful, and best of luck on your file manipulation and
Daniel J. Nordlund
Research and Data Analysis
Washington State Department of Social and Health Services
Olympia, WA 98504-5204
|Thread||Thread Starter||Forum||Replies||Last Post|
|Re: UNIX datastep question||Jennifer||Newsgroup comp.soft-sys.sas||0||06-28-2006 01:24 AM|
|UNIX datastep question||Ian Whitlock||Newsgroup comp.soft-sys.sas||0||06-27-2006 11:07 PM|
|Re: UNIX datastep question||William W. Viergever||Newsgroup comp.soft-sys.sas||0||06-27-2006 10:24 PM|
|Re: UNIX datastep question||Hill, Andrew||Newsgroup comp.soft-sys.sas||0||06-27-2006 09:29 PM|
|Re: UNIX datastep question||plessthanpoinohfive||Newsgroup comp.soft-sys.sas||0||06-27-2006 09:29 PM|