|
|||
|
Dan,
In this kind of scenario, where N(small)/N(large) is very small, any method reading the large file as infrequently as possible will win compared to any method required to read the whole file. From this standpoint, if you need to run the cohort against the claims just a few times, binary search is the answer because it uses the simple fact that an ordered file is already intrinsically self-indexed. To test, I have expanded your claims file 10-fold (in an attempt to eliminate small background noise effects; irrelevant run stats omitted; all run under AIX): 1 data cohort; 2 do _n_ = 1 to 1e3; 3 id = ceil(1e7*uniform(16437)); 4 output; 5 end; 6 run; NOTE: The data set USER.COHORT has 1000 observations and 1 variables. 7 run; 8 data claims ; 9 do id = 1 to 1e7; 10 _n_ = ceil(5*uniform(77219)); 11 do claim = 1 to _n_; 12 value = uniform(0); 13 output; 14 end; 15 end; 16 run; NOTE: The data set USER.CLAIMS has 30005555 observations and 3 variables. Now the binary search (duly expanded to account for the collection of duplicates in CLAIMS file) results in: 66 data testbin1 ( drop = _: ) ; 67 set cohort ; 68 _l = 1 ; 69 _h = n ; 70 do until ( _l > _h | _match ) ; 71 m = floor ((_l + _h) * .5) ; 72 set claims (rename = (id = _id)) point = m nobs = n ; 73 if id < _id then _h = m - 1 ; 74 else if id > _id then _l = m + 1 ; 75 else _match = 1 ; 76 end ; 77 if not _match then delete ; 78 _p = m ; 79 do m = _p - 0 by -1 while ( m >= 1 ) ; 80 set claims (rename = (id = _id)) point = m ; 81 if _id < id then leave ; 82 output ; 83 end ; 84 do m = _p + 1 by +1 while ( m <= n ) ; 85 set claims (rename = (id = _id)) point = m ; 86 if _id > id then leave ; 87 output ; 88 end ; 89 run ; NOTE: There were 1000 observations read from the data set USER.COHORT. NOTE: The data set USER.TESTBIN1 has 3072 observations and 3 variables. NOTE: DATA statement used (Total process time): real time 2.44 seconds If you need to run it once, or a few times, or do not want to do anything to the large file (say, index it), or cannot do it (say, have insufficient permission level), then the above is your best option. (Actually, interpolation search would still run faster - in this particular case, much faster, but if the key values distribution in the large file is skewed, it can turn to be a bad flop). If you have the liberty to index the large file beforehand and let people exploit its functionality via the SQL optimizer or SET with KEY= option, not really caring about the time needed to build the index, it is difficult to outperform: 56 proc sql ; 57 create index id on claims (id) ; NOTE: Simple index id has been defined. NOTE: SQL Statement used (Total process time): real time 1:06.82 cpu time 50.01 seconds 58 create table test4 as 59 select b.* 60 from cohort as a left join 61 claims as b 62 on a.id = b.id 63 order by id, claim 64 ; INFO: Index id of SQL table USER.CLAIMS (alias = B) selected for SQL WHERE clause (outer join) optimization. NOTE: Table USER.TEST4 created, with 3072 rows and 3 columns. NOTE: SQL Statement used (Total process time): real time 0.67 seconds You have also seen yourself how well SET with KEY= option works, so I am not repeating it here. Of course, the mechanics behind it are the same as behind the join above: (1) search the index (2) if ID from COHORT is found in CLAIMS, return their row IDs (3) use the row ids to read their records from CLAIMS. Now why does this scheme works appreciably faster than the binary search against the disk-resident file? Likely because searching the index where IDs are unique (with multiple row ids hanging on to them) is faster than binary searching the whole file, all the more that every time M points to a different page in the file, the current page has to be suspended and a new page - brought into the buffer, both being quite costly. This can be alleviated by moving the file to memory using SASFILE, but methinks that here we assume its memory footprint prohibitive. On the other hand, if you have enough latitude to create the index (practically, a separate file), there is no reason why the same cannot be done "by hand" to optimize the binary search process. All we need to do is to store unique IDs and their row number ranges in a separate SAS file; then use it in conjunction with the sorted CLAIMS file basically in the same manner it is done via SAS indexing. It makes all the more sense that the file is already sorted, hence creating the index file requires no more than a single read: 90 data index (keep = id l h) ; 91 l = sum (h, 1) ; 92 do h = l by 1 until (last.id) ; 93 set claims (sortedby = id) ; 94 by id ; 95 end ; 96 run ; NOTE: There were 30005555 observations read from the data set USER.CLAIMS. NOTE: The data set USER.INDEX has 10000000 observations and 3 variables. NOTE: DATA statement used (Total process time): real time 35.35 seconds cpu time 24.95 seconds 97 data testbin2 ( drop = _: ) ; 98 set cohort ; 99 _l = 1 ; 00 _h = n ; 101 do until ( _l > _h | _match ) ; 102 m = floor ((_l + _h) * .5) ; 103 set index (rename = (id = _id)) point = m nobs = n ; 104 if id < _id then _h = m - 1 ; 105 else if id > _id then _l = m + 1 ; 106 else _match = 1 ; 107 end ; 108 if _match then do m = l to h ; 109 set claims (drop = id) point = m ; 110 output ; 111 end ; 112 run ; NOTE: There were 1000 observations read from the data set USER.COHORT. NOTE: The data set USER.TESTBIN2 has 3072 observations and 5 variables. NOTE: DATA statement used (Total process time): real time 0.52 seconds It appears that the reasons why binary search was hosed down a bit can have been guessed right. Although such hand-crafted index outperforms the SAS index merely by a hair (and of course the reverse can be true in a different test under different machine circumstances), building of the custom index on a sorted file is about twice faster. Incidentally, these matters have been discussed rather profusely over time in SAS-L. Just look binary or interpolation search up in the archives, and you will have more than enough pre-bed-time reading to do. Kind regards ------------ Paul Dorfman Jax, FL ------------ On Wed, 4 Jun 2008 12:17:53 -0700, Nordlund, Dan (DSHS/RDA) <NordlDJ@DSHS.WA.GOV> wrote: >I have been helping new SAS users in my office get up to speed on accessing a large SAS data repository. In the process I have discovered that at least one of the approaches I have used historically is very inefficient (simle data step merge). I thought I would describe a common type of task we do, "benchmark" some approaches, and ask for ideas for further improvement. > >We do a lot of analyses of Medicaid claims data for various subpopulations of Medicaid eligible clients. We often have a file with client IDs and we wish to extract all claims for those clients for further processing. The file of IDs is in sorted order and the claims data are in ID order. > >Here are some approaches that I have tried when extracting claims data (I will create some sample data in case some of you would like to play along at home :-). > >**--create some sample data--**; >data cohort; > do _n_ = 1 to 1000; > id = ceil(1000000*uniform(16437)); > output; > end; >run; >proc sort nodupkey data=cohort; > by id; >run; >data claims; > do id = 1 to 1000000; > _n_ = ceil(5*uniform(77219)); > do claim = 1 to _n_; > value = uniform(0); > output; > end; > end; >run; > >**--TEST1: data step merge with subsetting IF--**; >data test1; > merge cohort(in=in_cohort) > claims; > by id; > if in_cohort; >run; > > >NOTE: There were 1000 observations read from the data set WORK.COHORT. >NOTE: There were 3001966 observations read from the data set WORK.CLAIMS. >NOTE: The data set WORK.TEST1 has 2943 observations and 3 variables. >NOTE: DATA statement used (Total process time): > real time 1.54 seconds > cpu time 1.54 seconds > >In TEST1, the first thing that was a little surprising to me was that when the merge runs out of records in the cohort file, it continues to read the rest of the records from the claims file (this is true even if there is only one ID in the cohort file and it is the first ID in the claims file). > >Not having grown up with sql, I haven't used it a lot. Below I present some SQL approaches to this data extraction problem. First, TEST2 used a LEFT JOIN providing no information about the structure of the data. This ran slower than the data step merge. In TEST3, I added the SORTEDBY= option on the claims file and the extract was almost an order of magnitude faster. In TEST4, I indexed the claims file on ID and got another order of magnitude speed-up. While in TEST5 I tried the data step merge again and the presence of an index seemed to slow it down (Indexes are only used by WHERE clauses in the data step?). All the times below are typical of multiple runs that I did, and I see similar results using our real data. > >So in this kind of task SQL seems to be the clear winner. One question I have is whether there is a way to use a data step and achieve the same performance as the SQL approaches? Some kind of key index approach maybe? > >Are there other approaches that I should be considering when doing this kind of data extaction? > >Thanks for any feedback, and hopefully others will find something useful in the discussion. > >Dan > >**--TEST2: sql join, no index and no info on sort order--**; >proc sql ; > create table test2 as > select b.* > from cohort as a left join > claims as b > on a.id = b.id > order by id, claim > ; >quit; >NOTE: Table WORK.TEST2 created, with 2943 rows and 3 columns. > >NOTE: PROCEDURE SQL used (Total process time): > real time 7.23 seconds > cpu time 5.79 seconds > >**--TEST3: sql join with sortedby option on claims--**;proc sql ; > create table test3 as > select b.* > from cohort as a left join > claims(sortedby=id) as b > on a.id = b.id > order by id, claim > ; >quit; >NOTE: Table WORK.TEST3 created, with 2943 rows and 3 columns. > >NOTE: PROCEDURE SQL used (Total process time): > real time 0.89 seconds > cpu time 0.87 seconds > > >If I index the claims file on ID, then the extraction is very speady. > >**--create index id on claims--**; >proc sql; > create index id on claims(id); >quit; > >**--TEST4: sql join with index--**; >proc sql ; > create table test4 as > select b.* > from cohort as a left join > claims as b > on a.id = b.id > order by id, claim > ; >quit; >NOTE: Table WORK.TEST4 created, with 2943 rows and 3 columns. > >NOTE: PROCEDURE SQL used (Total process time): > real time 0.09 seconds > cpu time 0.06 seconds > > >**--TEST5: data step merge - presence of index seems to slow things down--**; >data test5; > merge cohort(in=in_cohort) > claims; > by id; > if in_cohort; >run; > >NOTE: There were 1000 observations read from the data set WORK.COHORT. >NOTE: There were 3001966 observations read from the data set WORK.CLAIMS. >NOTE: The data set WORK.TEST6 has 2943 observations and 3 variables. >NOTE: DATA statement used (Total process time): > real time 4.18 seconds > cpu time 4.18 seconds > > >Daniel J. Nordlund >Washington State Department of Social and Health Services >Planning, Performance, and Accountability >Research and Data Analysis Division >Olympia, WA 98504-5204 |
|
|
||||
|
||||
|
|
![]() |
| Thread Tools | |
| Display Modes | |
|
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Re: Question about efficient data extraction | Muthia Kachirayan | Newsgroup comp.soft-sys.sas | 0 | 06-05-2008 12:13 AM |
| Re: Basic question, I hope...SAS XML too | Don Henderson | Newsgroup comp.soft-sys.sas | 1 | 07-14-2007 07:44 AM |
| Re: exist function question | Joe Whitehurst | Newsgroup comp.soft-sys.sas | 1 | 09-04-2006 09:21 AM |
| Re: Views and passes (was RE: Output last record of fantom by | Paul M. Dorfman | Newsgroup comp.soft-sys.sas | 0 | 07-14-2005 04:40 AM |
| Re: PROC SQL vs DATA STEP Question | Sigurd Hermansen | Newsgroup comp.soft-sys.sas | 0 | 01-07-2005 06:21 PM |