Go Back   Rhinocerus > Newsgroup > Newsgroup comp.soft-sys.sas

Reply
 
Thread Tools Display Modes
  #1 (permalink)  
Old 06-07-2007, 02:04 AM
nospam@HOWLES.COM (Howard Schreier
Guest
 
Posts: n/a
Default Re: PROC FREQ--DATA STEP--MODELING QUESTION

On Wed, 6 Jun 2007 15:15:34 -0500, Tom White <tw2@MAIL.COM> wrote:

>Yes, it looks like I may have to split my table into smaller tables and
>then run my PROC FREQ and then put them back together. I was already
>thinking of that. Thank you.tom


Do the partitioning by means of BY processing. Then there will be no need to
split and reassemble.

>
> ----- Original Message -----
> From: "Gerstle, John (CDC/CCID/NCHHSTP) (CTR)"
> To: SAS-L@LISTSERV.UGA.EDU
> Subject: Re: PROC FREQ--DATA STEP--MODELING QUESTION
> Date: Wed, 6 Jun 2007 15:47:28 -0400
>
>
> Missed most of this thread...
>
> The reason that it won't run on your original data is that you have
> too
> many observations for your machine to crunch thru.
>
> Matthew's example split your data into several BY groups, thereby
> reducing the amount of data run thru PROC FREQ.
>
> Given that your data are already sorted (which may not be the case),
> then
> PROC FREQ DATA=FOO;
> BY ID;
> TABLE CODE;
> RUN;
> Should work.
>
> To create your table below, I'd look into ODS OUTPUT instead of the
> OUT=
> option.
>
> i.e.,:
> ODS OUTPUT ONEWAYFREQS=CODE_ID;
> PROC FREQ DATA=FOO;
> BY ID;
> TABLE CODE;
> RUN;
> ODS OUTPUT CLOSE;
>
> John Gerstle
> MS Applied Neuroscience, MS Applied Statistics
> Biostatistician
> Northrop Grumman
> CDC Information Technological Support Contract (CITS)
> NCHSTP \DHAP \HICSB \Research and Dissemination Team
> Centers for Disease Control and Prevention
> yzg9 at cdc dot gov
>
> >> -----Original Message-----
> >> From: owner-sas-l@listserv.uga.edu

> [mailtowner-sas-l@listserv.uga.edu]
> >> On Behalf Of Tom White
> >> Sent: Wednesday, June 06, 2007 3:35 PM
> >> To: Muthia Kachirayan
> >> Cc: sas-l@listserv.uga.edu
> >> Subject: Re: Re: PROC FREQ--DATA STEP--MODELING QUESTION
> >>
> >> Hi Muthia,
> >>
> >> I know Matthew's solution seems the simplest.
> >>
> >> But no one has yet responded:
> >>
> >> How can I get this PROC FREQ
> >>
> >> proc FREQ data = foo;
> >> tables ID * CODE/ noprint outpct out=TEST1;
> >> run;
> >>
> >>
> >> to run on my real data set of 10 million records?
> >>
> >> The above code and Matthew's code do different things because his

> code
> >> counts ONLY inside the ID group,
> >> whereas mine does the cross tabulation ID * CODE.
> >>
> >> Both PROC FRECs (Matthew's and mine) are important for my problem.

> Except
> >> that mine does not run.
> >>
> >> The only way I suspect for my PROC FREQ to run is to convert it

> into a
> >> data step. But how?
> >>
> >> Thank you.
> >>
> >> tom
> >>
> >>
> >> ----- Original Message -----
> >> From: "Muthia Kachirayan"
> >> To: SAS-L@LISTSERV.UGA.EDU
> >> Subject: Re: PROC FREQ--DATA STEP--MODELING QUESTION
> >> Date: Wed, 6 Jun 2007 15:13:30 -0400
> >>
> >>
> >> Tom,
> >>
> >> Use of array might be little difficult for you. The solution given

> by
> >> Matthew is far simpler. In the new output using the code
> >>
> >> proc sort data=foo; by ID; run;
> >> proc freq data=foo;
> >> tables CODE/ noprint out=TEST2;
> >> by ID;
> >> run;
> >>
> >> your PCT_Row is the PERCENT in the output. To get your Percent,

> you
> need
> >> to
> >> add COUNT for all IDs having non-blank CODE. Let it be K. Then

> divide
> each
> >> COUNT by K and multiply by 100. Your PCT_Col is 100 always.
> >>
> >> Regards,
> >>
> >> Muthia Kachirayan
> >>
> >>
> >> On 6/6/07, Tom White wrote:
> >>>
> >>> Yes, it looks like I may have to do some kind of merging.
> >>>
> >>> Here is what I don't understand with the PROC FREQ.
> >>>
> >>> Please use the FOO data set as an example.
> >>>
> >>>
> >>> When I run my version of PROC FREQ (which does not run on my real

> data
> >>> set, WHY?????),
> >>>
> >>> proc FREQ data = foo;
> >>> tables ID * CODE/ noprint outpct out=TEST1;
> >>> run;
> >>>
> >>>
> >>> and when I run your version of PROC FREQ (which does run on my

> real
> data
> >>> set)
> >>>
> >>> proc sort data=foo; by ID; run;
> >>> proc freq data=foo;
> >>> tables CODE/ noprint out=TEST2;
> >>> by ID;
> >>> run;
> >>>
> >>> (1)
> >>> I don't get the same results for the Percent of Total Frequency
> >> (PERCENT).
> >>>
> >>> Clearly, then, the two procedures divide by different numbers.
> >>>
> >>> (2)
> >>> Also, in my version of PROC FREQ, how is the Percent of Row

> Frequency
> >>> (PCT_ROW) calculated?
> >>>
> >>> (3)
> >>> Is there a way to run my version of PROC FREQ on my real data set
> >> without
> >>> getting an error message?
> >>> Maybe some kind of a data step instead?
> >>>
> >>> I don't know--is this a difficult problem fro SAS-L gurus?
> >>>
> >>> Thank you.
> >>>
> >>> tom
> >>>
> >>>
> >>> ----- Original Message -----
> >>> From: "Zack, Matthew M. (CDC/CCHP/NCCDPHP)"
> >>> To: "Tom White"
> >>> Subject: RE: PROC FREQ--DATA STEP--MODELING QUESTION
> >>> Date: Wed, 6 Jun 2007 14:10:42 -0400
> >>>
> >>>
> >>> Since PROC FREQ will write the variables, ID and CODE, to an

> output
> SAS
> >>> data set, you can
> >>> merge this summarized data set back with your original data by

> these
> two
> >>> variables.
> >>>
> >>> Matthew Zack
> >>>
> >>> -----Original Message-----
> >>> From: Tom White [mailto:tw2@mail.com]
> >>> Sent: Wednesday, June 06, 2007 2:06 PM
> >>> To: Zack, Matthew M. (CDC/CCHP/NCCDPHP)
> >>> Cc: sas-l@listserv.uga.edu
> >>> Subject: RE: PROC FREQ--DATA STEP--MODELING QUESTION
> >>>
> >>> Thank you for your suggestion Matthew.
> >>>
> >>> I will try your attached recommendation.
> >>>
> >>> But, I don't think I can use PROC FREQ because this procedure

> rolls
> up
> >>> my records by ID.
> >>>
> >>> Think of the FOO data set as you see it. Then imagine other

> columns
> >>> beyond the ones I show in FOO.
> >>>
> >>> Suppose a few of these other colunns represent SALARY, AGE,

> WEIGHT,
> >>> EDUCATION_LEVEL, etc.
> >>>
> >>> Now, by running PROC FREQ as shown in your attachmnet, it will

> roll
> up
> >>> by observation by ID.
> >>>
> >>> In the process, what will happen to the other fields?
> >>>
> >>> Let's say, as an example, that in my real data set I have 153 obs

> of
> >>> ID=1271. Suppose that 32 of these obs have missing CODE, and the
> >>> remaining of the non-missing obs (=153-32 obs) take on various

> values of
> >>> CODE as shown in FOO.
> >>>
> >>> Now, by running PROC FREQ, what will happen to the other fields

> lke
> >>> SALARY, etc. I certainly don't want to roll them up the way PROC

> FREQ
> >>> does, yet, at the same time, I do want to have the counts and
> >>> percentages for the CODE field (by ID) as PROC FREQ shows.
> >>>
> >>> That's my problem.
> >>>
> >>> Maybe first I run PROC FREQ, then I do some kind of a merge

> between
> the
> >>> table the PROC FREQ produces and my original table.
> >>>
> >>> Maybe that way I will preserve my original data in its present

> form,
> yet
> >>> be able to have the results of PROC FREQ on CODE by ID in there

> as
> well.
> >>> But yet I do want to create counts and percentages and so on for

> my
> CODE
> >>> variable. This is a very important predictor variable and as

> such,
> we
> >>> want to create more predictive variables out of it like the ones
> >>> produced by PROC FREQ.
> >>>
> >>> Do you see my problem now?
> >>>
> >>> Any suggestions?
> >>>
> >>> Thank you.
> >>>
> >>> tom
> >>>
> >>>
> >>> ----- Original Message -----
> >>> From: "Zack, Matthew M. (CDC/CCHP/NCCDPHP)"
> >>> To: "Tom White"
> >>> Subject: RE: PROC FREQ--DATA STEP--MODELING QUESTION
> >>> Date: Wed, 6 Jun 2007 13:26:24 -0400
> >>>
> >>>
> >>> To get part of the first table, sort by ID, and tabulate the
> >>> CODEs
> >>> within each ID (cf., attachment).
> >>> The output data set contains the variables--ID, CODE, COUNT, and
> >>> PERCENT--where PERCENT corresponds
> >>> to your variable, PCT_ROW.
> >>>
> >>> In your table 1, I don't see any worth for the variable,
> >>> PCT_COL, which
> >>> always has the value, 100.
> >>> In your table 2, the value of your variable, PERCENT, equals the
> >>> percentage of the total number
> >>> of observations across all IDs and non-missing codes
> >>> [=(1/16)*100% =
> >>> 6.25%], which again has little
> >>> worth, because it too always has the value 6.25.
> >>>
> >>> Matthew Zack
> >>>
> >>> -----Original Message-----
> >>> From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On
> >>> Behalf Of Tom
> >>> White
> >>> Sent: Wednesday, June 06, 2007 11:13 AM
> >>> To: SAS-L@LISTSERV.UGA.EDU
> >>> Subject: Re: PROC FREQ--DATA STEP--MODELING QUESTION
> >>>
> >>> Well, it looks like I can'r even get PROC FREQ to create the
> >>> firts
> >>> table, as the error message below indicates. The second table is
> >>> needed
> >>> because that's how my real data set looks like. When I apply
> >>> PROC FREQ
> >>> to the second table (well, if anyone helps me with PROC FREQ),
> >>> then I
> >>> get the firts table. PROC FREQ rolls up the rows in the second
> >>> table and
> >>> turns them into the raws in the first table. So, for my
> >>> purposes, I need
> >>> to create additianl predictive fields using counts and
> >>> percentages for
> >>> the CODE variable. This CODE variable is avery important
> >>> variable from a
> >>> business perspective, so we thought that we should use it to
> >>> generate
> >>> more predictive fields out of it like counts and percentages. I
> >>> thought
> >>> that if I could even genereate table (1) then maybe it would be
> >>> easier
> >>> to genearte table (2). Any help, anyone? Thank you. tom
> >>>
> >>> ----- Original Message -----
> >>> From: "Zack, Matthew M. (CDC/CCHP/NCCDPHP)"
> >>> To: "Tom White"
> >>> Subject: RE: PROC FREQ--DATA STEP--MODELING QUESTION
> >>> Date: Wed, 6 Jun 2007 07:57:17 -0400
> >>>
> >>>
> >>> Why do you need your second table to run PROC LOGISTIC?
> >>>
> >>> Matthew Zack
> >>>
> >>> -----Original Message-----
> >>> From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On
> >>> Behalf Of
> >>> Tom
> >>> White
> >>> Sent: Tuesday, June 05, 2007 6:27 PM
> >>> To: SAS-L@LISTSERV.UGA.EDU
> >>> Subject: PROC FREQ--DATA STEP--MODELING QUESTION
> >>>
> >>> Hello SAS-L,
> >>>
> >>> The data below are already sorted (or can be sorted if need be)
> >>> by ID
> >>> (first) and CODE (second).
> >>>
> >>> The data set of interest contains about 10 mil records and about
> >>> 100
> >>> fields.
> >>>
> >>> In this example, I show 20 obs and two fields of interest.
> >>>
> >>>
> >>> data foo;
> >>> input ID $ CODE $;
> >>> cards;
> >>> 1271 .
> >>> 1271 201
> >>> 1435 .
> >>> 1435 842
> >>> 1435 842
> >>> 1435 307
> >>> 1435 307
> >>> 1435 307
> >>> 1435 309
> >>> 1435 .
> >>> 1434 .
> >>> 8393 070
> >>> 8393 070
> >>> 8393 070
> >>> 8393 070
> >>> 8393 070
> >>> 8393 070
> >>> 8393 070
> >>> 8393 070
> >>> 8393 070
> >>> ;
> >>> run;
> >>>
> >>>
> >>> I would like to produce two datasets like:
> >>>
> >>> (1)
> >>>
> >>> ID CODE COUNT PERCENT PCT_ROW PCT_COL
> >>> 1271 1
> >>> 1271 201 1 6.25 100 100
> >>> 1435 3
> >>> 1435 842 2 12.5 33.33 100
> >>> 1435 307 3 18.75 50 100
> >>> 1435 309 1 6.25 16.67 100
> >>> 8393 070 9 56.25 100 100
> >>>
> >>> This data set is easily created by using;
> >>>
> >>> proc freq data=foo;
> >>> tables ID * CODE/outpct out=stats;
> >>> run;
> >>>
> >>>
> >>> However, when I run this PROC FREQ on the entire dataset of 10
> >>> mil
> >>> obs,
> >>> I get a message error
> >>>
> >>>
> >>> ERROR: The requested table is too large to process.
> >>> NOTE: The SAS System stopped processing this step because of
> >>> errors.
> >>> NOTE: There were 10154176 observations read from the data set
> >>> WORK.FOO.
> >>> WARNING: The data set WORK.STATS may be incomplete. When this
> >>> step
> >>> was
> >>> stopped there were 0
> >>> observations and 6 variables.
> >>> WARNING: Data set WORK.STATS was not replaced because this step
> >>> was
> >>> stopped.
> >>> NOTE: PROCEDURE FREQ used:
> >>> real time 54.70 seconds
> >>> cpu time 53.28 seconds
> >>>
> >>>
> >>>
> >>> (2)
> >>>
> >>> The other table I would like to get is
> >>>
> >>> ID CODE COUNT PERCENT PCT_ROW PCT_COL
> >>> 1271
> >>> 1271 201 1 6.25 100 100
> >>> 1435 1
> >>> 1435 842 1 6.25 16.665 50
> >>> 1435 842 1 6.25 16.665 50
> >>> 1435 307 1 6.25 16.667 33.33
> >>> 1435 307 1 6.25 16.667 33.33
> >>> 1435 307 1 6.25 16.667 33.33
> >>> 1435 309 1 6.25 16.67 100
> >>> 1435 1
> >>> 1434 1
> >>> 8393 070 1 6.25 11.11 11.11
> >>> 8393 070 1 6.25 11.11 11.11
> >>> 8393 070 1 6.25 11.11 11.11
> >>> 8393 070 1 6.25 11.11 11.11
> >>> 8393 070 1 6.25 11.11 11.11
> >>> 8393 070 1 6.25 11.11 11.11
> >>> 8393 070 1 6.25 11.11 11.11
> >>> 8393 070 1 6.25 11.11 11.11
> >>> 8393 070 1 6.25 11.11 11.11
> >>>
> >>> This table is simply the same one as table (1) except that it is
> >>> not
> >>> rolled-up like table (1) is.
> >>>
> >>> For example, in (1) I show,
> >>>
> >>> ID CODE COUNT PERCENT PCT_ROW PCT_COL
> >>> 1435 3
> >>> 1435 842 2 12.5 33.33 100
> >>> 1435 307 3 18.75 50 100
> >>> 1435 309 1 6.25 16.67 100
> >>>
> >>> That's 9 instances of ID 1435 as shown in tbale (2) above--not
> >>> rolled-up.
> >>>
> >>> So I am thinking, since I will need table (2) at some point for
> >>> modeling, I need to keep (2) in it's original form and not like
> >>> in
> >>> form
> >>> (1) coming out of PROC FREQ. yet, I still need to create the new
> >>> modeling variables COUNT, PERCENT, PCT_ROW, and maybe PCT_COL as
> >>> shown
> >>> in (1) and populate these corect values in the form of table
> >>> (2).
> >>>
> >>> These new variables in (2) will become inputs to my logistic
> >>> model I
> >>> am
> >>> trying to build.
> >>>
> >>> So, then, since in (1) PERCENT=12.5 for ID=1435 having COUNT=2,
> >>> then,
> >>> I
> >>> am thinking, maybe wrongly, that if I divide 12.5 by 2, I should
> >>> put
> >>> 6.25 in above table (2).
> >>>
> >>> And so on with all the rest of the numbers, I just divide
> >>> numbers in
> >>> (1) by however many counts (COUNT) I have, and hopefully I get
> >>> something
> >>> like table (2).
> >>>
> >>> Please give me some guidance as to how to create table (1) and
> >>> then,
> >>> for
> >>> modeling purposes, how can I keep the original data intact, yet
> >>> create
> >>> new modeling fields like counts and percentages as shown above
> >>> in
> >>> (2).
> >>>
> >>> Thank you.
> >>>
> >>> tom
> >>>
> >>>
> >>>
> >>>
> >>> =
> >>> Pedometers as Low as $1 - Free Shipping
> >>> Huge Selection of Quality Brands Like, Yamax, Sportline,
> >>> Freestyle,
> >>> and
> >>> More. Customer Logos, Free Shipping. Fast Delivery.
> >>>
> >>>
> >>>

> http://a8-asy.a8ww.net/a8-ads/adftrc...f8bac8e902c6af
> >>> baf95c7b7930
> >>>
> >>> --
> >>> Get a free http://www.mail.com account & e-mail address today!
> >>> Choose from over 100 personalized domains.
> >>> << freqprb2.sas >>
> >>> << freqprb2.lst >>
> >>>
> >>>
> >>>
> >>> --
> >>> Get a free mail.com account
> >>>

> erm=070601&utm_content=textlink&utm_campaign=100_p ersonalized_domains>
> >>> & e-mail address today!
> >>> Choose from over 100 personalized domains.
> >>>
> >>>
> >>> --
> >>> Get a free http://www.mail.com account & e-mail address today!
> >>> Choose from over 100 personalized domains.
> >>>
> >>
> >>
> >> --
> >> Get a free http://www.mail.com account & e-mail address today!
> >> Choose from over 100 personalized domains.

>
>--
>Get a free http://www.mail.com account & e-mail address today!
>Choose from over 100 personalized domains.

Reply With Quote
Alt Today
Advertising
 
and become member of Rhinocerus
Standard Sponsored Links

Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are Off
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: Question about efficient data extraction Muthia Kachirayan Newsgroup comp.soft-sys.sas 0 06-05-2008 12:13 AM
Re: Why use a view? data _null_, Newsgroup comp.soft-sys.sas 0 02-21-2008 02:30 PM
Re: Automatically outputing the results of a PROC FREQ Long, Stuart Newsgroup comp.soft-sys.sas 0 11-21-2007 05:36 PM
Re: FW: PROC FREQ--DATA STEP--MODELING QUESTION: PROC FREQ memory Tom White Newsgroup comp.soft-sys.sas 0 06-07-2007 05:14 PM
Re: Data step questions Ian Whitlock Newsgroup comp.soft-sys.sas 1 09-09-2006 10:37 AM



All times are GMT. The time now is 11:04 AM.


Copyright ©2009

LinkBacks Enabled by vBSEO 3.3.0 RC2 © 2009, Crawlability, Inc.