Go Back   Rhinocerus > Newsgroup > Newsgroup comp.soft-sys.sas

Reply
 
Thread Tools Display Modes
  #1 (permalink)  
Old 07-14-2005, 04:40 AM
Paul M. Dorfman
Guest
 
Posts: n/a
Default Re: Views and passes (was RE: Output last record of fantom by

Sig,

Interesting idea. I took it a bit further going on to code:

data out_hv / view = out_hv ;
dcl hash q (hashexp:16, ordered:'y') ;
dcl hiter qi ('q') ;
q.definekey ('id', 'yr') ;
q.definedata ('id', 'yr', 'date') ;
q.definedone () ;
do until ( eof ) ;
set have end = eof ;
yr = year (date) ;
q.replace () ;
end ;
do _iorc_ = qi.next() by 0 while ( _iorc_ = 0 ) ;
output ;
_iorc_ = qi.next() ;
end ;
stop ;
run ;

proc sql _method ;
create table out as
select t1.*
from have (sortedby = id date) as t1 inner join
out_hv (sortedby = id yr ) as t2
on t1.ID = t2.ID
and t1.date = t2.date
;
quit ;

This compares OK within the margin of SAS log stats' time measurement error with
the writing of OUT_H using the .OUTPUT() method and consuming OUT_H by SQL.
Though the Data step alone with .OUTPUT() get the job done, it is nice to know
that there is no problem generating a Data step view even if the Data step
contains a hash object.

Kind regards
----------------
Paul M. Dorfman
Jacksonville, FL
----------------


> -----Original Message-----
> From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On
> Behalf Of Sigurd Hermansen
> Sent: Wednesday, July 13, 2005 6:00 PM
> To: SAS-L@LISTSERV.UGA.EDU
> Subject: Re: Views and passes (was RE: Output last record of
> fantom by group)
>
> Roy (and others entangled in this thread):
> So far as I know a view reads and writes data just like a
> Data step or query, but only after being invoked by another
> Data step or query. I've expanded Paul's examples to
> illustrate the time required for each process (in real time seconds):
>
> Data step 1 Data step 2 SQL Query
> sort 8:14 8:14
> hash order 12.50
> select 21.40 13.53 8.00
> ------- -------- -------
> 29.54 21.67 20.50
>
> I've fattened up the rows of data by adding a long text
> variable to each row. In practical situations, we tend to
> select more than key values.
>
> A sort orders the dataset prior to the Data steps that rely
> on ordering properties. A slightly modified hash routine
> selects and orders a list of ID,date pairs that the SQL query
> uses to select required records. In general the more rows,
> the faster the hash routine will work relative to a physical
> sort. The Data step that groups by date clearly takes
> substantially longer than either the Data step or SQL query
> that groups by year.
>
> I would expect to see some variation in the difference
> between the Data step 2 and SQL query times. Taking time
> required for data ordering into account, a Data step
> processing a Data step view and the SQL query work equally fast.
>
> For this type of data subsetting problem, the idea of using a
> hash selector to select a subset of key values, then using
> the subset of key values to select rows from a dataset, has
> interesting implications for SAS SQL query optimization. I
> wonder about hashing pointers to observations in SAS datasets ....
>
> I've posted relevant sections of the log below:
>
> 294 data have ;
> 295 do date = -1e4 to 1e4 by 1 ;
> 296 do id = 1 to 1e2 ;
> 297 stuff=repeat('X',999);
> 298 output ;
> 299 end ;
> 300 end ;
> 301 run ;
>
> INFO: Character variables have defaulted to a length of 200
> at the places given by:
> (Line)Column). Truncation may result.
> 297:12 stuff
> NOTE: The data set WORK.HAVE has 2000100 observations and 3 variables.
> NOTE: DATA statement used (Total process time):
> real time 54.25 seconds
> cpu time 13.67 seconds
>
>
> 302 proc sort data=have;
> 303 by id date;
> 304 run;
>
> NOTE: There were 2000100 observations read from the data set
> WORK.HAVE.
> NOTE: SAS sort was used.
> NOTE: The data set WORK.HAVE has 2000100 observations and 3 variables.
> NOTE: PROCEDURE SORT used (Total process time):
> real time 8:14.01
> cpu time 46.65 seconds
>
>
> 305 data out_g ;
> 306 format date year4. ;
> 307 set have ;
> 308 by id date groupformat ;
> 309 if first.date ;
> 310 run ;
>
> NOTE: There were 2000100 observations read from the data set
> WORK.HAVE.
> NOTE: The data set WORK.OUT_G has 5600 observations and 3 variables.
> NOTE: DATA statement used (Total process time):
> real time 21.40 seconds
> cpu time 5.73 seconds
>
>
> 311
> 312 data _null_ ;
> 313 dcl hash q (hashexp:16, ordered:'y') ;
> 314 q.definekey ('id', 'yr') ;
> 315 q.definedata ('id', 'yr', 'date') ;
> 316 q.definedone () ;
> 317 do until ( eof ) ;
> 318 set have end = eof ;
> 319 yr = year (date) ;
> 320 q.replace () ;
> 321 end ;
> 322 q.output (dataset: 'out_h') ;
> 323 stop ;
> 324 run ;
>
> NOTE: The data set WORK.OUT_H has 5600 observations and 3 variables.
> NOTE: There were 2000100 observations read from the data set
> WORK.HAVE.
> NOTE: DATA statement used (Total process time):
> real time 12.50 seconds
> cpu time 5.76 seconds
>
>
> 325 proc print data=out_h (obs=10);
> 326 run;
>
> NOTE: There were 10 observations read from the data set WORK.OUT_H.
> NOTE: PROCEDURE PRINT used (Total process time):
> real time 0.28 seconds
> cpu time 0.03 seconds
>
>
> 327
> 328 proc sql ;
> 329 create view with_year as
> 330 select id
> 331 , year (date) as yr
> 332 from have
> 333 quit ;
> NOTE: SQL view WORK.WITH_YEAR has been defined.
> 334
> NOTE: PROCEDURE SQL used (Total process time):
> real time 0.34 seconds
> cpu time 0.00 seconds
>
>
> 335 data out_v ;
> 336 set with_year ;
> 337 by id yr ;
> 338 if last.yr ;
> 339 run ;
>
> NOTE: There were 2000100 observations read from the data set
> WORK.HAVE.
> NOTE: There were 2000100 observations read from the data set
> WORK.WITH_YEAR.
> NOTE: The data set WORK.OUT_V has 5600 observations and 2 variables.
> NOTE: DATA statement used (Total process time):
> real time 13.53 seconds
> cpu time 8.43 seconds
>
>
> 340
> 341 proc sql ;
> 342 create table out as
> 343 select t1.*
> 344 from have as t1 inner join out_h as t2
> 345 on t1.ID=t2.ID and t1.date=t2.date
> 346 ;
> NOTE: Table WORK.OUT created, with 5600 rows and 3 columns.
>
> 347 quit ;
> NOTE: PROCEDURE SQL used (Total process time):
> real time 8.00 seconds
> cpu time 3.43 seconds
>
>
>
> -----Original Message-----
> From: owner-sas-l@listserv.uga.edu
> [mailtowner-sas-l@listserv.uga.edu]
> On Behalf Of Pardee, Roy
> Sent: Tuesday, July 12, 2005 2:20 PM
> To: SAS-L@LISTSERV.UGA.EDU
> Subject: Views and passes (was RE: Output last record of fantom by
> group)
>
>
> So a dstep view doesn't get 'realized' when it's used, so you
> really do save a pass through the data?
>
> My expectation would be that a SQL view *would* get realized,
> so the savings would be illusory. Anybody know for sure?
> <snip>
>
> -----Original Message-----
> From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On
> Behalf Of Dale McLerran
> Sent: Tuesday, July 12, 2005 11:09 AM
> To: SAS-L@LISTSERV.UGA.EDU
> Subject: Re: Output last record of fantom by group
>
>
> Michael,
>
> A data step view is perfect for this sort of problem. You
> describe year as a fantom (sic) variable. Well, a view
> creates phantom
> (virtual) data.
>
>
> /* Add year to view. Note that view does not execute until it */
> /* is invoked in the next data step. The view is virtual data */
> data addyear / view=addyear;
> set have;
> year = year(date);
> run;
>
>
> /* Set view by permno and year. Year is created as the data */
> /* in have are read by the view but before the data are made */
> /* available to the PDV of the current data step. Thus, we */
> /* can set the data by year. */
> data want;
> set addyear;
> by permno year;
> if last.year;
> run;
>
>
> Dale
>
>
> --- Michael Murff <mjm33@MSM1.BYU.EDU> wrote:
>
> > Hi SAS-L,
> >
> >
> >
> > I want the last record of each id/year as shown below; year

> is a sort
> > of fantom by-group since it only exists within another

> variable, date.
>
> > The dataset "have" is 6 gigs+ so I would prefer to only

> pass through
> > the data once. The dataset "have" is sorted by permno and date.
> > Obviously, I could create a year variable with one pass and

> then, in a
> > subsequent step, use a subsetting if statement to

> conditionally output
> > last.year, but given the size of have this is undesirable.

> Any ideas?
> >
> >
> >
> > TIA,
> >
> >
> >
> > Michael Murff
> >
> > Provo, UT
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > data have;
> >
> > input id $ date date9. type $;
> >
> > format date date9.;
> >
> > cards;
> >
> > a 01jan2001 F
> >
> > a 02jan2001 M
> >
> > a 03jan2001 L
> >
> > a 01jan2002 F
> >
> > a 02jan2002 L
> >
> > b 01jan2001 F
> >
> > b 02jan2001 M
> >
> > b 03jan2001 L
> >
> > b 01jan2002 F
> >
> > b 02jan2002 L
> >
> > b 03jan2003 F
> >
> > b 04jan2003 M
> >
> > b 05jan2003 M
> >
> > b 06jan2003 M
> >
> > b 07jan2003 L
> >
> > ;
> >
> > run;
> >
> >
> >
> >
> >
> > data want;
> >
> > input id $ date date9. type $;
> >
> > format date date9.;
> >
> > cards;
> >
> > a 03jan2001 L
> >
> > a 02jan2002 L
> >
> > b 03jan2001 L
> >
> > b 02jan2002 L
> >
> > b 07jan2003 L
> >
> > ;
> >
> > run;
> >

>
>
> ---------------------------------------
> Dale McLerran
> Fred Hutchinson Cancer Research Center
> mailto: dmclerra@NO_SPAMfhcrc.org
> Ph: (206) 667-2926
> Fax: (206) 667-5977
> ---------------------------------------
>
>
>
> __________________________________________________ __
> Sell on Yahoo! Auctions - no fees. Bid on great items.
> yahoo.com/" target="_blank">http://auctions.yahoo.com/
>
> -----Original Message-----
> From: owner-sas-l@listserv.uga.edu
> [mailtowner-sas-l@listserv.uga.edu]
> On Behalf Of Pardee, Roy
> Sent: Tuesday, July 12, 2005 2:20 PM
> To: SAS-L@LISTSERV.UGA.EDU
> Subject: Views and passes (was RE: Output last record of fantom by
> group)
>
>
> So a dstep view doesn't get 'realized' when it's used, so you
> really do save a pass through the data?
>
> My expectation would be that a SQL view *would* get realized,
> so the savings would be illusory. Anybody know for sure?
>
> -----Original Message-----
> From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On
> Behalf Of Dale McLerran
> Sent: Tuesday, July 12, 2005 11:09 AM
> To: SAS-L@LISTSERV.UGA.EDU
> Subject: Re: Output last record of fantom by group
>
>
> Michael,
>
> A data step view is perfect for this sort of problem. You
> describe year as a fantom (sic) variable. Well, a view
> creates phantom
> (virtual) data.
>
>
> /* Add year to view. Note that view does not execute until it */
> /* is invoked in the next data step. The view is virtual data */
> data addyear / view=addyear;
> set have;
> year = year(date);
> run;
>
>
> /* Set view by permno and year. Year is created as the data */
> /* in have are read by the view but before the data are made */
> /* available to the PDV of the current data step. Thus, we */
> /* can set the data by year. */
> data want;
> set addyear;
> by permno year;
> if last.year;
> run;
>
>
> Dale
>
>
> --- Michael Murff <mjm33@MSM1.BYU.EDU> wrote:
>
> > Hi SAS-L,
> >
> >
> >
> > I want the last record of each id/year as shown below; year

> is a sort
> > of fantom by-group since it only exists within another

> variable, date.
>
> > The dataset "have" is 6 gigs+ so I would prefer to only

> pass through
> > the data once. The dataset "have" is sorted by permno and date.
> > Obviously, I could create a year variable with one pass and

> then, in a
> > subsequent step, use a subsetting if statement to

> conditionally output
> > last.year, but given the size of have this is undesirable.

> Any ideas?
> >
> >
> >
> > TIA,
> >
> >
> >
> > Michael Murff
> >
> > Provo, UT
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > data have;
> >
> > input id $ date date9. type $;
> >
> > format date date9.;
> >
> > cards;
> >
> > a 01jan2001 F
> >
> > a 02jan2001 M
> >
> > a 03jan2001 L
> >
> > a 01jan2002 F
> >
> > a 02jan2002 L
> >
> > b 01jan2001 F
> >
> > b 02jan2001 M
> >
> > b 03jan2001 L
> >
> > b 01jan2002 F
> >
> > b 02jan2002 L
> >
> > b 03jan2003 F
> >
> > b 04jan2003 M
> >
> > b 05jan2003 M
> >
> > b 06jan2003 M
> >
> > b 07jan2003 L
> >
> > ;
> >
> > run;
> >
> >
> >
> >
> >
> > data want;
> >
> > input id $ date date9. type $;
> >
> > format date date9.;
> >
> > cards;
> >
> > a 03jan2001 L
> >
> > a 02jan2002 L
> >
> > b 03jan2001 L
> >
> > b 02jan2002 L
> >
> > b 07jan2003 L
> >
> > ;
> >
> > run;
> >

>
>
> ---------------------------------------
> Dale McLerran
> Fred Hutchinson Cancer Research Center
> mailto: dmclerra@NO_SPAMfhcrc.org
> Ph: (206) 667-2926
> Fax: (206) 667-5977
> ---------------------------------------
>
>
>
> __________________________________________________ __
> Sell on Yahoo! Auctions - no fees. Bid on great items.
> yahoo.com/" target="_blank">http://auctions.yahoo.com/
>

Reply With Quote
Alt Today
Advertising
 
and become member of Rhinocerus
Standard Sponsored Links

Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are Off
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: Views and passes (was RE: Output last record of fantom by Jack Hamilton Newsgroup comp.soft-sys.sas 0 07-15-2005 05:43 AM
Re: Views and passes (was RE: Output last record of fantom by Sigurd Hermansen Newsgroup comp.soft-sys.sas 0 07-13-2005 10:00 PM
Re: Views and passes (was RE: Output last record of fantom by nospam@HOWLES.COM (Howard Schreier Newsgroup comp.soft-sys.sas 0 07-12-2005 11:10 PM
Re: Views and passes (was RE: Output last record of fantom by Dale McLerran Newsgroup comp.soft-sys.sas 0 07-12-2005 10:20 PM
Views and passes (was RE: Output last record of fantom by group) Pardee, Roy Newsgroup comp.soft-sys.sas 0 07-12-2005 06:19 PM



All times are GMT. The time now is 07:03 AM.


Copyright ©2009

LinkBacks Enabled by vBSEO 3.3.0 RC2 © 2009, Crawlability, Inc.