Go Back   Rhinocerus > Newsgroup > Newsgroup comp.soft-sys.sas

Reply
 
Thread Tools Display Modes
  #1 (permalink)  
Old 06-05-2008, 04:20 AM
Paul Dorfman
Guest
 
Posts: n/a
Default Re: Question about efficient data extraction

Dan,

In this kind of scenario, where N(small)/N(large) is very small, any method
reading the large file as infrequently as possible will win compared to any
method required to read the whole file. From this standpoint, if you need to
run the cohort against the claims just a few times, binary search is the
answer because it uses the simple fact that an ordered file is already
intrinsically self-indexed.

To test, I have expanded your claims file 10-fold (in an attempt to
eliminate small background noise effects; irrelevant run stats omitted; all
run under AIX):

1 data cohort;
2 do _n_ = 1 to 1e3;
3 id = ceil(1e7*uniform(16437));
4 output;
5 end;
6 run;
NOTE: The data set USER.COHORT has 1000 observations and 1 variables.

7 run;
8 data claims ;
9 do id = 1 to 1e7;
10 _n_ = ceil(5*uniform(77219));
11 do claim = 1 to _n_;
12 value = uniform(0);
13 output;
14 end;
15 end;
16 run;
NOTE: The data set USER.CLAIMS has 30005555 observations and 3
variables.

Now the binary search (duly expanded to account for the collection of
duplicates in CLAIMS file) results in:

66 data testbin1 ( drop = _: ) ;
67 set cohort ;
68 _l = 1 ;
69 _h = n ;
70 do until ( _l > _h | _match ) ;
71 m = floor ((_l + _h) * .5) ;
72 set claims (rename = (id = _id)) point = m nobs = n ;
73 if id < _id then _h = m - 1 ;
74 else if id > _id then _l = m + 1 ;
75 else _match = 1 ;
76 end ;
77 if not _match then delete ;
78 _p = m ;
79 do m = _p - 0 by -1 while ( m >= 1 ) ;
80 set claims (rename = (id = _id)) point = m ;
81 if _id < id then leave ;
82 output ;
83 end ;
84 do m = _p + 1 by +1 while ( m <= n ) ;
85 set claims (rename = (id = _id)) point = m ;
86 if _id > id then leave ;
87 output ;
88 end ;
89 run ;
NOTE: There were 1000 observations read from the data set USER.COHORT.
NOTE: The data set USER.TESTBIN1 has 3072 observations and 3 variables.
NOTE: DATA statement used (Total process time):
real time 2.44 seconds

If you need to run it once, or a few times, or do not want to do anything to
the large file (say, index it), or cannot do it (say, have insufficient
permission level), then the above is your best option. (Actually,
interpolation search would still run faster - in this particular case, much
faster, but if the key values distribution in the large file is skewed, it
can turn to be a bad flop).

If you have the liberty to index the large file beforehand and let people
exploit its functionality via the SQL optimizer or SET with KEY= option, not
really caring about the time needed to build the index, it is difficult to
outperform:

56 proc sql ;
57 create index id on claims (id) ;
NOTE: Simple index id has been defined.
NOTE: SQL Statement used (Total process time):
real time 1:06.82
cpu time 50.01 seconds

58 create table test4 as
59 select b.*
60 from cohort as a left join
61 claims as b
62 on a.id = b.id
63 order by id, claim
64 ;
INFO: Index id of SQL table USER.CLAIMS (alias = B) selected for SQL
WHERE clause (outer join) optimization.
NOTE: Table USER.TEST4 created, with 3072 rows and 3 columns.
NOTE: SQL Statement used (Total process time):
real time 0.67 seconds

You have also seen yourself how well SET with KEY= option works, so I am not
repeating it here. Of course, the mechanics behind it are the same as behind
the join above: (1) search the index (2) if ID from COHORT is found in
CLAIMS, return their row IDs (3) use the row ids to read their records from
CLAIMS.

Now why does this scheme works appreciably faster than the binary search
against the disk-resident file? Likely because searching the index where IDs
are unique (with multiple row ids hanging on to them) is faster than binary
searching the whole file, all the more that every time M points to a
different page in the file, the current page has to be suspended and a new
page - brought into the buffer, both being quite costly. This can be
alleviated by moving the file to memory using SASFILE, but methinks that
here we assume its memory footprint prohibitive.

On the other hand, if you have enough latitude to create the index
(practically, a separate file), there is no reason why the same cannot be
done "by hand" to optimize the binary search process. All we need to do is
to store unique IDs and their row number ranges in a separate SAS file; then
use it in conjunction with the sorted CLAIMS file basically in the same
manner it is done via SAS indexing. It makes all the more sense that the
file is already sorted, hence creating the index file requires no more than
a single read:

90 data index (keep = id l h) ;
91 l = sum (h, 1) ;
92 do h = l by 1 until (last.id) ;
93 set claims (sortedby = id) ;
94 by id ;
95 end ;
96 run ;
NOTE: There were 30005555 observations read from the data set
USER.CLAIMS.
NOTE: The data set USER.INDEX has 10000000 observations and 3 variables.
NOTE: DATA statement used (Total process time):
real time 35.35 seconds
cpu time 24.95 seconds

97 data testbin2 ( drop = _: ) ;
98 set cohort ;
99 _l = 1 ;
00 _h = n ;
101 do until ( _l > _h | _match ) ;
102 m = floor ((_l + _h) * .5) ;
103 set index (rename = (id = _id)) point = m nobs = n ;
104 if id < _id then _h = m - 1 ;
105 else if id > _id then _l = m + 1 ;
106 else _match = 1 ;
107 end ;
108 if _match then do m = l to h ;
109 set claims (drop = id) point = m ;
110 output ;
111 end ;
112 run ;
NOTE: There were 1000 observations read from the data set USER.COHORT.
NOTE: The data set USER.TESTBIN2 has 3072 observations and 5 variables.
NOTE: DATA statement used (Total process time):
real time 0.52 seconds

It appears that the reasons why binary search was hosed down a bit can have
been guessed right. Although such hand-crafted index outperforms the SAS
index merely by a hair (and of course the reverse can be true in a different
test under different machine circumstances), building of the custom index on
a sorted file is about twice faster.

Incidentally, these matters have been discussed rather profusely over time
in SAS-L. Just look binary or interpolation search up in the archives, and
you will have more than enough pre-bed-time reading to do.

Kind regards
------------
Paul Dorfman
Jax, FL
------------

On Wed, 4 Jun 2008 12:17:53 -0700, Nordlund, Dan (DSHS/RDA)
<NordlDJ@DSHS.WA.GOV> wrote:

>I have been helping new SAS users in my office get up to speed on accessing

a large SAS data repository. In the process I have discovered that at least
one of the approaches I have used historically is very inefficient (simle
data step merge). I thought I would describe a common type of task we do,
"benchmark" some approaches, and ask for ideas for further improvement.
>
>We do a lot of analyses of Medicaid claims data for various subpopulations

of Medicaid eligible clients. We often have a file with client IDs and we
wish to extract all claims for those clients for further processing. The
file of IDs is in sorted order and the claims data are in ID order.
>
>Here are some approaches that I have tried when extracting claims data (I

will create some sample data in case some of you would like to play along at
home :-).
>
>**--create some sample data--**;
>data cohort;
> do _n_ = 1 to 1000;
> id = ceil(1000000*uniform(16437));
> output;
> end;
>run;
>proc sort nodupkey data=cohort;
> by id;
>run;
>data claims;
> do id = 1 to 1000000;
> _n_ = ceil(5*uniform(77219));
> do claim = 1 to _n_;
> value = uniform(0);
> output;
> end;
> end;
>run;
>
>**--TEST1: data step merge with subsetting IF--**;
>data test1;
> merge cohort(in=in_cohort)
> claims;
> by id;
> if in_cohort;
>run;
>
>
>NOTE: There were 1000 observations read from the data set WORK.COHORT.
>NOTE: There were 3001966 observations read from the data set WORK.CLAIMS.
>NOTE: The data set WORK.TEST1 has 2943 observations and 3 variables.
>NOTE: DATA statement used (Total process time):
> real time 1.54 seconds
> cpu time 1.54 seconds
>
>In TEST1, the first thing that was a little surprising to me was that when

the merge runs out of records in the cohort file, it continues to read the
rest of the records from the claims file (this is true even if there is only
one ID in the cohort file and it is the first ID in the claims file).
>
>Not having grown up with sql, I haven't used it a lot. Below I present

some SQL approaches to this data extraction problem. First, TEST2 used a
LEFT JOIN providing no information about the structure of the data. This
ran slower than the data step merge. In TEST3, I added the SORTEDBY= option
on the claims file and the extract was almost an order of magnitude faster.
In TEST4, I indexed the claims file on ID and got another order of
magnitude speed-up. While in TEST5 I tried the data step merge again and
the presence of an index seemed to slow it down (Indexes are only used by
WHERE clauses in the data step?). All the times below are typical of
multiple runs that I did, and I see similar results using our real data.
>
>So in this kind of task SQL seems to be the clear winner. One question I

have is whether there is a way to use a data step and achieve the same
performance as the SQL approaches? Some kind of key index approach maybe?
>
>Are there other approaches that I should be considering when doing this

kind of data extaction?
>
>Thanks for any feedback, and hopefully others will find something useful in

the discussion.
>
>Dan
>
>**--TEST2: sql join, no index and no info on sort order--**;
>proc sql ;
> create table test2 as
> select b.*
> from cohort as a left join
> claims as b
> on a.id = b.id
> order by id, claim
> ;
>quit;
>NOTE: Table WORK.TEST2 created, with 2943 rows and 3 columns.
>
>NOTE: PROCEDURE SQL used (Total process time):
> real time 7.23 seconds
> cpu time 5.79 seconds
>
>**--TEST3: sql join with sortedby option on claims--**;proc sql ;
> create table test3 as
> select b.*
> from cohort as a left join
> claims(sortedby=id) as b
> on a.id = b.id
> order by id, claim
> ;
>quit;
>NOTE: Table WORK.TEST3 created, with 2943 rows and 3 columns.
>
>NOTE: PROCEDURE SQL used (Total process time):
> real time 0.89 seconds
> cpu time 0.87 seconds
>
>
>If I index the claims file on ID, then the extraction is very speady.
>
>**--create index id on claims--**;
>proc sql;
> create index id on claims(id);
>quit;
>
>**--TEST4: sql join with index--**;
>proc sql ;
> create table test4 as
> select b.*
> from cohort as a left join
> claims as b
> on a.id = b.id
> order by id, claim
> ;
>quit;
>NOTE: Table WORK.TEST4 created, with 2943 rows and 3 columns.
>
>NOTE: PROCEDURE SQL used (Total process time):
> real time 0.09 seconds
> cpu time 0.06 seconds
>
>
>**--TEST5: data step merge - presence of index seems to slow things down--**;
>data test5;
> merge cohort(in=in_cohort)
> claims;
> by id;
> if in_cohort;
>run;
>
>NOTE: There were 1000 observations read from the data set WORK.COHORT.
>NOTE: There were 3001966 observations read from the data set WORK.CLAIMS.
>NOTE: The data set WORK.TEST6 has 2943 observations and 3 variables.
>NOTE: DATA statement used (Total process time):
> real time 4.18 seconds
> cpu time 4.18 seconds
>
>
>Daniel J. Nordlund
>Washington State Department of Social and Health Services
>Planning, Performance, and Accountability
>Research and Data Analysis Division
>Olympia, WA 98504-5204

Reply With Quote
Alt Today
Advertising
 
and become member of Rhinocerus
Standard Sponsored Links

Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are Off
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: Question about efficient data extraction Muthia Kachirayan Newsgroup comp.soft-sys.sas 0 06-05-2008 12:13 AM
Re: Basic question, I hope...SAS XML too Don Henderson Newsgroup comp.soft-sys.sas 1 07-14-2007 07:44 AM
Re: exist function question Joe Whitehurst Newsgroup comp.soft-sys.sas 1 09-04-2006 09:21 AM
Re: Views and passes (was RE: Output last record of fantom by Paul M. Dorfman Newsgroup comp.soft-sys.sas 0 07-14-2005 04:40 AM
Re: PROC SQL vs DATA STEP Question Sigurd Hermansen Newsgroup comp.soft-sys.sas 0 01-07-2005 06:21 PM



All times are GMT. The time now is 01:21 AM.


Copyright ©2009

LinkBacks Enabled by vBSEO 3.3.0 RC2 © 2009, Crawlability, Inc.