|
|||
|
Steve:
I've been on the sidelines working on pressing tasks. Ordinarily I would have jumped in an suggested a SQL solution.... Not this time.... I would go with a hash index, or even a bit array if you have enough free memory to hold the ID's of the patients in each update: - put the ID's for an update in a hash index / bit array as prescribed in the Dorfman Letters to SAS-L (using a temporary array in a Data step); - UPDATE the flag in the master file IF the ID in a row gets a hit on the index/array. - append the update dataset to the master dataset. It should take less than a minute to build the index. Scanning a 42M row dataset should take no more than a few minutes elapsed time on a reasonably quick machine. The index/array search should take a negligible amount of time. The whole process should scale up linearly. I don't particularly like the design of your database. I prefer to see separate tables for patient data and for events related to patients. As the database stands, the flag will make it easy to select the last update to the database for a patient. Sig -----Original Message----- From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of James, Steve Sent: Tuesday, January 04, 2005 3:10 PM To: SAS-L@LISTSERV.UGA.EDU Subject: Re: Big Table / Little Table Update Mike, The master data set has about 42 million records in it. I don't know how representative the numbers from this run are since it's just test code and the transaction table is actually created from the master (see log below). 150 records per patient sounds high to me. As I mentioned before I don't know the data that well. I'll have to find out more. I'm trying to run the program again and make some changes to bufsize and ibufsize that Michael Raithel suggested. It'll be interesting to see if it makes a difference. Steve 8 data Master (index=(key)) Trans ; 9 set j.vaDetailsPre04; 10 if datepart(storedate) ge mdy(12,31,2004) then output Trans ; 11 else output Master ; 12 run; NOTE: There were 42179057 observations read from the data set J.VADETAILSPRE04. NOTE: The data set WORK.MASTER has 42156526 observations and 29 variables. NOTE: The data set WORK.TRANS has 22531 observations and 29 variables. NOTE: DATA statement used (Total process time): real time 1:06:05.36 cpu time 17:24.85 -----Original Message----- From: Mike Rhoads [mailto:RHOADSM1@WESTAT.com] Sent: Monday, January 03, 2005 5:39 PM To: James, Steve; SAS-L@LISTSERV.UGA.EDU Subject: RE: Big Table / Little Table Update Steve, Glad you're making (some) progress! A few thoughts / suggestions / questions follow. While you're still testing things out, I might use a separate PROC SQL step for each PROC SQL statement. The extra overhead should be minimal, and that would give you a better idea of how each statement is performing time-wise. You can remove the extra PROC SQLs when / if you put the code into production. If I'm understanding the log, you have 22531 transactions for 5323 patients. Your final DATA step reports that 780,472 records were updated, which implies that your master file averages around 150 records per patient -- does that sound correct? Do you know about how many patients, and records, are in your master? It would be interesting to know approximately what percentage get updated each day. Mike Rhoads Westat RhoadsM1@Westat.com -----Original Message----- From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of James, Steve Sent: Monday, January 03, 2005 5:00 PM To: SAS-L@LISTSERV.UGA.EDU Subject: Big Table / Little Table Update Thanks for the helpful tips last week about my problem concerning updating a big table with duplicate key values from a much smaller transaction table. Several people gave me some good advice as well as code. I'm testing a solution from Mike Rhoads (see log below) and I was able to get the CPU time down to under 4 minutes (from 42 minutes). However the clock time remains quite long. For the run today it was over 2 hours just for the update. When I ran it last week (after most people had left for the holiday), it ran in about 45 minutes of clock time with about the same CPU time as before. I can't understand why there's such a discrepancy: 4 minutes vs 2 hours? Is it I/O? How can I convince my co-worker that while his solution takes 42 minutes of CPU time but roughly the same amount of real time that he should switch methods to the one taking less CPU time? We're running on a Sunfire 6800 w/ 8 processors running SAS 9.1 under Solaris 64 bit. Steve James Centers for Disease Control and Prevention sjames@cdc.gov 21 /*---------------------------------------------------------------------- --- 22 Append Trans data to Master. 23 ------------------------------------------------------------------------ -*/ 24 proc sql; 25 insert into Master select * from Trans 26 ; NOTE: 22531 rows were inserted into WORK.MASTER. 27 28 /*---------------------------------------------------------------------- --- 29 Create a table containing the Keys that are in Master that also are in 30 Trans. Since you have already appended Trans to Master, there should be 31 no IDs in Trans that are not in Master. 32 ------------------------------------------------------------------------ -*/ 33 create table KeysInTrans as 34 select key, max(storedate) as date 35 from Trans 36 group by key 37 ; NOTE: Table WORK.KEYSINTRANS created, with 5323 rows and 2 columns. 38 39 /*---------------------------------------------------------------------- -- 40 Get the most recent visit date for each Key. 41 ------------------------------------------------------------------------ --*/ 42 CREATE TABLE MostRecentVisit AS 43 SELECT T.Key, MAX(StoreDate) AS Date 44 FROM KeysInTrans AS T, Master as M 45 where T.Key = M.Key 46 GROUP BY T.Key ; NOTE: Table WORK.MOSTRECENTVISIT created, with 5323 rows and 2 columns. 47 48 quit; NOTE: PROCEDURE SQL used (Total process time): real time 1:39:35.63 cpu time 4:29.72 49 50 /*---------------------------------------------------------------------- --- 51 Now read in record in Trans and update the corresponding records in Master. 52 ------------------------------------------------------------------------ -*/ 3 The SAS System 11:10 Monday, January 3, 2005 53 54 data Master ; 55 set MostRecentVisit end=lastob ; 56 do until (_iorc_ = %sysrc(_dsenom)); 57 modify Master key=key ; 58 select (_iorc_); 59 when (%sysrc(_sok)) do; 60 if storedate eq date 61 then cdc_avrflag=1 ; 62 else cdc_avrflag=0 ; 63 replace Master ; 64 records_updated + 1 ; 65 end; /* end of where sysrc=_sok */ 66 when (%sysrc(_dsenom)) do; 67 _error_ = 0; 68 end ; 69 otherwise; 70 end; /* end of select (_iorc_) statement */ 71 end; /* end of do until () statement */ 72 if lastob then put 'the number of records updated were: ' records_updated comma9. ; 73 run; the number of records updated were: 780,472 NOTE: There were 5323 observations read from the data set WORK.MOSTRECENTVISIT. NOTE: The data set WORK.MASTER has been updated. There were 780472 observations rewritten, 0 observations added and 0 observations deleted. NOTE: DATA statement used (Total process time): real time 2:21:58.26 cpu time 3:52.51 |
|
|
||||
|
||||
|
|
![]() |
| Thread Tools | |
| Display Modes | |
|
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Re: Update a table on itself | Sigurd Hermansen | Newsgroup comp.soft-sys.sas | 0 | 12-11-2006 02:10 PM |
| Re: Big Table / Little Table Update | Chang Y. Chung | Newsgroup comp.soft-sys.sas | 0 | 12-29-2004 04:13 AM |
| Re: Big Table / Little Table Update | nospam@HOWLES.COM (Howard Schreier | Newsgroup comp.soft-sys.sas | 0 | 12-29-2004 01:38 AM |
| Re: Big Table / Little Table Update | James, Steve | Newsgroup comp.soft-sys.sas | 0 | 12-28-2004 10:15 PM |
| Re: Big Table / Little Table Update | Mike Rhoads | Newsgroup comp.soft-sys.sas | 0 | 12-28-2004 07:40 PM |