|
|||
|
On Wed, 6 Jun 2007 15:15:34 -0500, Tom White <tw2@MAIL.COM> wrote:
>Yes, it looks like I may have to split my table into smaller tables and >then run my PROC FREQ and then put them back together. I was already >thinking of that. Thank you.tom Do the partitioning by means of BY processing. Then there will be no need to split and reassemble. > > ----- Original Message ----- > From: "Gerstle, John (CDC/CCID/NCHHSTP) (CTR)" > To: SAS-L@LISTSERV.UGA.EDU > Subject: Re: PROC FREQ--DATA STEP--MODELING QUESTION > Date: Wed, 6 Jun 2007 15:47:28 -0400 > > > Missed most of this thread... > > The reason that it won't run on your original data is that you have > too > many observations for your machine to crunch thru. > > Matthew's example split your data into several BY groups, thereby > reducing the amount of data run thru PROC FREQ. > > Given that your data are already sorted (which may not be the case), > then > PROC FREQ DATA=FOO; > BY ID; > TABLE CODE; > RUN; > Should work. > > To create your table below, I'd look into ODS OUTPUT instead of the > OUT= > option. > > i.e.,: > ODS OUTPUT ONEWAYFREQS=CODE_ID; > PROC FREQ DATA=FOO; > BY ID; > TABLE CODE; > RUN; > ODS OUTPUT CLOSE; > > John Gerstle > MS Applied Neuroscience, MS Applied Statistics > Biostatistician > Northrop Grumman > CDC Information Technological Support Contract (CITS) > NCHSTP \DHAP \HICSB \Research and Dissemination Team > Centers for Disease Control and Prevention > yzg9 at cdc dot gov > > >> -----Original Message----- > >> From: owner-sas-l@listserv.uga.edu > [mailto wner-sas-l@listserv.uga.edu]> >> On Behalf Of Tom White > >> Sent: Wednesday, June 06, 2007 3:35 PM > >> To: Muthia Kachirayan > >> Cc: sas-l@listserv.uga.edu > >> Subject: Re: Re: PROC FREQ--DATA STEP--MODELING QUESTION > >> > >> Hi Muthia, > >> > >> I know Matthew's solution seems the simplest. > >> > >> But no one has yet responded: > >> > >> How can I get this PROC FREQ > >> > >> proc FREQ data = foo; > >> tables ID * CODE/ noprint outpct out=TEST1; > >> run; > >> > >> > >> to run on my real data set of 10 million records? > >> > >> The above code and Matthew's code do different things because his > code > >> counts ONLY inside the ID group, > >> whereas mine does the cross tabulation ID * CODE. > >> > >> Both PROC FRECs (Matthew's and mine) are important for my problem. > Except > >> that mine does not run. > >> > >> The only way I suspect for my PROC FREQ to run is to convert it > into a > >> data step. But how? > >> > >> Thank you. > >> > >> tom > >> > >> > >> ----- Original Message ----- > >> From: "Muthia Kachirayan" > >> To: SAS-L@LISTSERV.UGA.EDU > >> Subject: Re: PROC FREQ--DATA STEP--MODELING QUESTION > >> Date: Wed, 6 Jun 2007 15:13:30 -0400 > >> > >> > >> Tom, > >> > >> Use of array might be little difficult for you. The solution given > by > >> Matthew is far simpler. In the new output using the code > >> > >> proc sort data=foo; by ID; run; > >> proc freq data=foo; > >> tables CODE/ noprint out=TEST2; > >> by ID; > >> run; > >> > >> your PCT_Row is the PERCENT in the output. To get your Percent, > you > need > >> to > >> add COUNT for all IDs having non-blank CODE. Let it be K. Then > divide > each > >> COUNT by K and multiply by 100. Your PCT_Col is 100 always. > >> > >> Regards, > >> > >> Muthia Kachirayan > >> > >> > >> On 6/6/07, Tom White wrote: > >>> > >>> Yes, it looks like I may have to do some kind of merging. > >>> > >>> Here is what I don't understand with the PROC FREQ. > >>> > >>> Please use the FOO data set as an example. > >>> > >>> > >>> When I run my version of PROC FREQ (which does not run on my real > data > >>> set, WHY?????), > >>> > >>> proc FREQ data = foo; > >>> tables ID * CODE/ noprint outpct out=TEST1; > >>> run; > >>> > >>> > >>> and when I run your version of PROC FREQ (which does run on my > real > data > >>> set) > >>> > >>> proc sort data=foo; by ID; run; > >>> proc freq data=foo; > >>> tables CODE/ noprint out=TEST2; > >>> by ID; > >>> run; > >>> > >>> (1) > >>> I don't get the same results for the Percent of Total Frequency > >> (PERCENT). > >>> > >>> Clearly, then, the two procedures divide by different numbers. > >>> > >>> (2) > >>> Also, in my version of PROC FREQ, how is the Percent of Row > Frequency > >>> (PCT_ROW) calculated? > >>> > >>> (3) > >>> Is there a way to run my version of PROC FREQ on my real data set > >> without > >>> getting an error message? > >>> Maybe some kind of a data step instead? > >>> > >>> I don't know--is this a difficult problem fro SAS-L gurus? > >>> > >>> Thank you. > >>> > >>> tom > >>> > >>> > >>> ----- Original Message ----- > >>> From: "Zack, Matthew M. (CDC/CCHP/NCCDPHP)" > >>> To: "Tom White" > >>> Subject: RE: PROC FREQ--DATA STEP--MODELING QUESTION > >>> Date: Wed, 6 Jun 2007 14:10:42 -0400 > >>> > >>> > >>> Since PROC FREQ will write the variables, ID and CODE, to an > output > SAS > >>> data set, you can > >>> merge this summarized data set back with your original data by > these > two > >>> variables. > >>> > >>> Matthew Zack > >>> > >>> -----Original Message----- > >>> From: Tom White [mailto:tw2@mail.com] > >>> Sent: Wednesday, June 06, 2007 2:06 PM > >>> To: Zack, Matthew M. (CDC/CCHP/NCCDPHP) > >>> Cc: sas-l@listserv.uga.edu > >>> Subject: RE: PROC FREQ--DATA STEP--MODELING QUESTION > >>> > >>> Thank you for your suggestion Matthew. > >>> > >>> I will try your attached recommendation. > >>> > >>> But, I don't think I can use PROC FREQ because this procedure > rolls > up > >>> my records by ID. > >>> > >>> Think of the FOO data set as you see it. Then imagine other > columns > >>> beyond the ones I show in FOO. > >>> > >>> Suppose a few of these other colunns represent SALARY, AGE, > WEIGHT, > >>> EDUCATION_LEVEL, etc. > >>> > >>> Now, by running PROC FREQ as shown in your attachmnet, it will > roll > up > >>> by observation by ID. > >>> > >>> In the process, what will happen to the other fields? > >>> > >>> Let's say, as an example, that in my real data set I have 153 obs > of > >>> ID=1271. Suppose that 32 of these obs have missing CODE, and the > >>> remaining of the non-missing obs (=153-32 obs) take on various > values of > >>> CODE as shown in FOO. > >>> > >>> Now, by running PROC FREQ, what will happen to the other fields > lke > >>> SALARY, etc. I certainly don't want to roll them up the way PROC > FREQ > >>> does, yet, at the same time, I do want to have the counts and > >>> percentages for the CODE field (by ID) as PROC FREQ shows. > >>> > >>> That's my problem. > >>> > >>> Maybe first I run PROC FREQ, then I do some kind of a merge > between > the > >>> table the PROC FREQ produces and my original table. > >>> > >>> Maybe that way I will preserve my original data in its present > form, > yet > >>> be able to have the results of PROC FREQ on CODE by ID in there > as > well. > >>> But yet I do want to create counts and percentages and so on for > my > CODE > >>> variable. This is a very important predictor variable and as > such, > we > >>> want to create more predictive variables out of it like the ones > >>> produced by PROC FREQ. > >>> > >>> Do you see my problem now? > >>> > >>> Any suggestions? > >>> > >>> Thank you. > >>> > >>> tom > >>> > >>> > >>> ----- Original Message ----- > >>> From: "Zack, Matthew M. (CDC/CCHP/NCCDPHP)" > >>> To: "Tom White" > >>> Subject: RE: PROC FREQ--DATA STEP--MODELING QUESTION > >>> Date: Wed, 6 Jun 2007 13:26:24 -0400 > >>> > >>> > >>> To get part of the first table, sort by ID, and tabulate the > >>> CODEs > >>> within each ID (cf., attachment). > >>> The output data set contains the variables--ID, CODE, COUNT, and > >>> PERCENT--where PERCENT corresponds > >>> to your variable, PCT_ROW. > >>> > >>> In your table 1, I don't see any worth for the variable, > >>> PCT_COL, which > >>> always has the value, 100. > >>> In your table 2, the value of your variable, PERCENT, equals the > >>> percentage of the total number > >>> of observations across all IDs and non-missing codes > >>> [=(1/16)*100% = > >>> 6.25%], which again has little > >>> worth, because it too always has the value 6.25. > >>> > >>> Matthew Zack > >>> > >>> -----Original Message----- > >>> From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On > >>> Behalf Of Tom > >>> White > >>> Sent: Wednesday, June 06, 2007 11:13 AM > >>> To: SAS-L@LISTSERV.UGA.EDU > >>> Subject: Re: PROC FREQ--DATA STEP--MODELING QUESTION > >>> > >>> Well, it looks like I can'r even get PROC FREQ to create the > >>> firts > >>> table, as the error message below indicates. The second table is > >>> needed > >>> because that's how my real data set looks like. When I apply > >>> PROC FREQ > >>> to the second table (well, if anyone helps me with PROC FREQ), > >>> then I > >>> get the firts table. PROC FREQ rolls up the rows in the second > >>> table and > >>> turns them into the raws in the first table. So, for my > >>> purposes, I need > >>> to create additianl predictive fields using counts and > >>> percentages for > >>> the CODE variable. This CODE variable is avery important > >>> variable from a > >>> business perspective, so we thought that we should use it to > >>> generate > >>> more predictive fields out of it like counts and percentages. I > >>> thought > >>> that if I could even genereate table (1) then maybe it would be > >>> easier > >>> to genearte table (2). Any help, anyone? Thank you. tom > >>> > >>> ----- Original Message ----- > >>> From: "Zack, Matthew M. (CDC/CCHP/NCCDPHP)" > >>> To: "Tom White" > >>> Subject: RE: PROC FREQ--DATA STEP--MODELING QUESTION > >>> Date: Wed, 6 Jun 2007 07:57:17 -0400 > >>> > >>> > >>> Why do you need your second table to run PROC LOGISTIC? > >>> > >>> Matthew Zack > >>> > >>> -----Original Message----- > >>> From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On > >>> Behalf Of > >>> Tom > >>> White > >>> Sent: Tuesday, June 05, 2007 6:27 PM > >>> To: SAS-L@LISTSERV.UGA.EDU > >>> Subject: PROC FREQ--DATA STEP--MODELING QUESTION > >>> > >>> Hello SAS-L, > >>> > >>> The data below are already sorted (or can be sorted if need be) > >>> by ID > >>> (first) and CODE (second). > >>> > >>> The data set of interest contains about 10 mil records and about > >>> 100 > >>> fields. > >>> > >>> In this example, I show 20 obs and two fields of interest. > >>> > >>> > >>> data foo; > >>> input ID $ CODE $; > >>> cards; > >>> 1271 . > >>> 1271 201 > >>> 1435 . > >>> 1435 842 > >>> 1435 842 > >>> 1435 307 > >>> 1435 307 > >>> 1435 307 > >>> 1435 309 > >>> 1435 . > >>> 1434 . > >>> 8393 070 > >>> 8393 070 > >>> 8393 070 > >>> 8393 070 > >>> 8393 070 > >>> 8393 070 > >>> 8393 070 > >>> 8393 070 > >>> 8393 070 > >>> ; > >>> run; > >>> > >>> > >>> I would like to produce two datasets like: > >>> > >>> (1) > >>> > >>> ID CODE COUNT PERCENT PCT_ROW PCT_COL > >>> 1271 1 > >>> 1271 201 1 6.25 100 100 > >>> 1435 3 > >>> 1435 842 2 12.5 33.33 100 > >>> 1435 307 3 18.75 50 100 > >>> 1435 309 1 6.25 16.67 100 > >>> 8393 070 9 56.25 100 100 > >>> > >>> This data set is easily created by using; > >>> > >>> proc freq data=foo; > >>> tables ID * CODE/outpct out=stats; > >>> run; > >>> > >>> > >>> However, when I run this PROC FREQ on the entire dataset of 10 > >>> mil > >>> obs, > >>> I get a message error > >>> > >>> > >>> ERROR: The requested table is too large to process. > >>> NOTE: The SAS System stopped processing this step because of > >>> errors. > >>> NOTE: There were 10154176 observations read from the data set > >>> WORK.FOO. > >>> WARNING: The data set WORK.STATS may be incomplete. When this > >>> step > >>> was > >>> stopped there were 0 > >>> observations and 6 variables. > >>> WARNING: Data set WORK.STATS was not replaced because this step > >>> was > >>> stopped. > >>> NOTE: PROCEDURE FREQ used: > >>> real time 54.70 seconds > >>> cpu time 53.28 seconds > >>> > >>> > >>> > >>> (2) > >>> > >>> The other table I would like to get is > >>> > >>> ID CODE COUNT PERCENT PCT_ROW PCT_COL > >>> 1271 > >>> 1271 201 1 6.25 100 100 > >>> 1435 1 > >>> 1435 842 1 6.25 16.665 50 > >>> 1435 842 1 6.25 16.665 50 > >>> 1435 307 1 6.25 16.667 33.33 > >>> 1435 307 1 6.25 16.667 33.33 > >>> 1435 307 1 6.25 16.667 33.33 > >>> 1435 309 1 6.25 16.67 100 > >>> 1435 1 > >>> 1434 1 > >>> 8393 070 1 6.25 11.11 11.11 > >>> 8393 070 1 6.25 11.11 11.11 > >>> 8393 070 1 6.25 11.11 11.11 > >>> 8393 070 1 6.25 11.11 11.11 > >>> 8393 070 1 6.25 11.11 11.11 > >>> 8393 070 1 6.25 11.11 11.11 > >>> 8393 070 1 6.25 11.11 11.11 > >>> 8393 070 1 6.25 11.11 11.11 > >>> 8393 070 1 6.25 11.11 11.11 > >>> > >>> This table is simply the same one as table (1) except that it is > >>> not > >>> rolled-up like table (1) is. > >>> > >>> For example, in (1) I show, > >>> > >>> ID CODE COUNT PERCENT PCT_ROW PCT_COL > >>> 1435 3 > >>> 1435 842 2 12.5 33.33 100 > >>> 1435 307 3 18.75 50 100 > >>> 1435 309 1 6.25 16.67 100 > >>> > >>> That's 9 instances of ID 1435 as shown in tbale (2) above--not > >>> rolled-up. > >>> > >>> So I am thinking, since I will need table (2) at some point for > >>> modeling, I need to keep (2) in it's original form and not like > >>> in > >>> form > >>> (1) coming out of PROC FREQ. yet, I still need to create the new > >>> modeling variables COUNT, PERCENT, PCT_ROW, and maybe PCT_COL as > >>> shown > >>> in (1) and populate these corect values in the form of table > >>> (2). > >>> > >>> These new variables in (2) will become inputs to my logistic > >>> model I > >>> am > >>> trying to build. > >>> > >>> So, then, since in (1) PERCENT=12.5 for ID=1435 having COUNT=2, > >>> then, > >>> I > >>> am thinking, maybe wrongly, that if I divide 12.5 by 2, I should > >>> put > >>> 6.25 in above table (2). > >>> > >>> And so on with all the rest of the numbers, I just divide > >>> numbers in > >>> (1) by however many counts (COUNT) I have, and hopefully I get > >>> something > >>> like table (2). > >>> > >>> Please give me some guidance as to how to create table (1) and > >>> then, > >>> for > >>> modeling purposes, how can I keep the original data intact, yet > >>> create > >>> new modeling fields like counts and percentages as shown above > >>> in > >>> (2). > >>> > >>> Thank you. > >>> > >>> tom > >>> > >>> > >>> > >>> > >>> = > >>> Pedometers as Low as $1 - Free Shipping > >>> Huge Selection of Quality Brands Like, Yamax, Sportline, > >>> Freestyle, > >>> and > >>> More. Customer Logos, Free Shipping. Fast Delivery. > >>> > >>> > >>> > http://a8-asy.a8ww.net/a8-ads/adftrc...f8bac8e902c6af > >>> baf95c7b7930 > >>> > >>> -- > >>> Get a free http://www.mail.com account & e-mail address today! > >>> Choose from over 100 personalized domains. > >>> << freqprb2.sas >> > >>> << freqprb2.lst >> > >>> > >>> > >>> > >>> -- > >>> Get a free mail.com account > >>> > erm=070601&utm_content=textlink&utm_campaign=100_p ersonalized_domains> > >>> & e-mail address today! > >>> Choose from over 100 personalized domains. > >>> > >>> > >>> -- > >>> Get a free http://www.mail.com account & e-mail address today! > >>> Choose from over 100 personalized domains. > >>> > >> > >> > >> -- > >> Get a free http://www.mail.com account & e-mail address today! > >> Choose from over 100 personalized domains. > >-- >Get a free http://www.mail.com account & e-mail address today! >Choose from over 100 personalized domains. |
|
|
||||
|
||||
|
|
![]() |
| Thread Tools | |
| Display Modes | |
|
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Re: Question about efficient data extraction | Muthia Kachirayan | Newsgroup comp.soft-sys.sas | 0 | 06-05-2008 12:13 AM |
| Re: Why use a view? | data _null_, | Newsgroup comp.soft-sys.sas | 0 | 02-21-2008 02:30 PM |
| Re: Automatically outputing the results of a PROC FREQ | Long, Stuart | Newsgroup comp.soft-sys.sas | 0 | 11-21-2007 05:36 PM |
| Re: FW: PROC FREQ--DATA STEP--MODELING QUESTION: PROC FREQ memory | Tom White | Newsgroup comp.soft-sys.sas | 0 | 06-07-2007 05:14 PM |
| Re: Data step questions | Ian Whitlock | Newsgroup comp.soft-sys.sas | 1 | 09-09-2006 10:37 AM |