|
|||
|
You've posted two addenda without quoting the earlier messages, which makes
things fragmented and hard to follow. I suggest re-stating the problem. Also, why not provide a starting point by attempting to code the solution, in psuedocode if not in SAS? On Wed, 4 May 2005 16:30:28 -0700, Yiyu <shenyiyu@GMAIL.COM> wrote: >Hi, here is a complicated problem in selecting observations: >The data set is like this: > >Date Stock Signal Porfolio >950501 A 0 0 >950501 B 1 1 >950501 C 0 0 >950501 D 1 1 >950501 E 0 0 > >950502 A 1 1 >950502 B 1 1 >950502 C 1 0 >950502 D 1 1 >950502 E 0 0 > >950503 A 1 1 >950503 B 1 1 >950503 C 1 0 >950503 D 1 1 >950503 E 1 0 > >950504 A 0 0 >950504 B 1 1 >950504 C 0 0 >950504 D 0 0 >950504 E 1 1 >................................... >................................... > >I want to build a portfolio which is essentially composed of all the >Stocks with signal 1, but with a condition: the number of stocks in >each date can't exceed a certain maximum, for example, 3, and more >stocks with singal 1 can't be added to the portfolio until there's some >vacant slots in the portfolio. So you can see at 950501, the portfolio >has B and D in it, but at 950502, even though both A and C also become >1, I can't put both of them into portfolio, instead I randomly select >one stock into portfolio, suppose A is randomly selected. Now the >portfolio is full, I don't add anthing until there's vacant in >portfolio again. So at 950503 even though E also has a singal of 1 it >cannot be added into the portfolio. At 950504 A and D are dropped out >of the potfolio due to the singal 0, now since E still has a singal 1 >it's added into the portfolio. > >Could you please tell me how to implement this algorithm in SAS? Thank >you very much! |
|
|
||||
|
||||
|
|
|
|||
|
"Howard Schreier <hs AT dc-sug DOT org>" wrote:
> You've posted two addenda without quoting the earlier messages, which > makes things fragmented and hard to follow. > > I suggest re-stating the problem. Also, why not provide a starting > point by attempting to code the solution, in psuedocode if not in SAS? > > On Wed, 4 May 2005 16:30:28 -0700, Yiyu <shenyiyu@GMAIL.COM> wrote: > >> Hi, here is a complicated problem in selecting observations: >> The data set is like this: >> >> Date Stock Signal Porfolio >> 950501 A 0 0 >> 950501 B 1 1 >> 950501 C 0 0 >> 950501 D 1 1 >> 950501 E 0 0 >> >> 950502 A 1 1 >> 950502 B 1 1 >> 950502 C 1 0 >> 950502 D 1 1 >> 950502 E 0 0 >> >> 950503 A 1 1 >> 950503 B 1 1 >> 950503 C 1 0 >> 950503 D 1 1 >> 950503 E 1 0 >> >> 950504 A 0 0 >> 950504 B 1 1 >> 950504 C 0 0 >> 950504 D 0 0 >> 950504 E 1 1 >> ................................... >> ................................... >> >> I want to build a portfolio which is essentially composed of all the >> Stocks with signal 1, but with a condition: the number of stocks in >> each date can't exceed a certain maximum, for example, 3, and more >> stocks with singal 1 can't be added to the portfolio until there's >> some vacant slots in the portfolio. So you can see at 950501, the >> portfolio has B and D in it, but at 950502, even though both A and C >> also become 1, I can't put both of them into portfolio, instead I >> randomly select one stock into portfolio, suppose A is randomly >> selected. Now the portfolio is full, I don't add anthing until >> there's vacant in portfolio again. So at 950503 even though E also >> has a singal of 1 it cannot be added into the portfolio. At 950504 A >> and D are dropped out of the potfolio due to the singal 0, now since >> E still has a singal 1 it's added into the portfolio. >> >> Could you please tell me how to implement this algorithm in SAS? Based on the criteria quoted by Howard I see this as a non-trivial task, not amenable to a single pass solution in SQL. The sample below, based on a generic interpretaion of the problem attacks it using a slew of atypical constructs. (Slew = 2) One. A double DOW loop Two. Dual hashes The solution amends the requirement by storing in the pool variable the 'age' of the item in the pool as it persists over groups. A more fair algorithm might consider a time series of the signal as the criteria for pool worthiness. At present, only the most current signal is the only criteria for worthiness. (But some may stipulate that too heavy a reliance on too long a time series can create unfair cliques of 'old boy' networks -- also, whilst the aging active pool members may be gruntled, the disgruntled items are those that missed the boat early on, and are exhibiting equally long strings of signals as those in the boat) ----------------------------- /* * Problem statement: * 0. A point has three attributes; itemIdentifier, signal and groupIdentifier * 1. Points with identical groupIdentifiers form a Group * 2. Groups are ordered by groupIdentifier * 3. A point is eligible for inclusion in an 'active pool' if its signal is 1 * 3b. A point is removed from the active pool if its signal is 0 * 4. The active pool persists across group iteration * 4b. The active pool has a limit * 5. Items in the active pool have preference over items newly eligible (seniority) * 6. When there are more items eligible than slots available in the active pool * then the eligible items enter the active pool by random selection via k/n */ * this sample uses hash to perform lookup; * if data is sensibly sized integers, * then arrays and direct addressing could be used and would be faster; %let seed = 0DEAFx; %let nGroups = 1000; %let nItems = 25; %let POOLSIZE = 12; data groupedItems; do group = 1 to &nGroups; do item = 1 to &nitems; signal = rannor(&seed) > 0; output; end; end; run; data itemsWithPoolIndicator ; if _n_ = 1 then do; declare hash AP(); * Active pool; AP.defineKey ('item'); AP.defineData('age'); AP.defineDone(); declare hash C(); * Pool Candidates; C.defineKey ('item'); C.defineData ('item'); C.defineDone(); declare hiter hi; end; * fill the groups candidate pool ; do until (last.group); set groupedItems (where=(not missing(item))); by group; * determine state; _isActive = (AP.find() = 0); * add items into candidate pool or remove from active pool; if ^_isActive and signal then C.replace(); else if _isActive and ^signal then AP.remove(); else if _isActive and signal then AP.replace(key:item,data:age+1); end; * move random candidates into available pool slots; _N = C.num_items; _K = &POOLSIZE - AP.num_items; if _K > 0 then do; age = 1; hi = _new_ hiter('C'); do while (hi.next()=0 and _N > 0); if ranuni (&SEED) < _K / _N then do; _K = _K - 1; AP.add(); end; _N = _N - 1; end; hi.delete(); end; * mark items as in active pool; do until (last.group); set groupedItems (where=(not missing(item))); by group; if AP.find()=0 then pool=age;else pool=0; OUTPUT; end; * clear the candidate hash; *--- this loopage will be C.CLEAR() in v9.2; hi = _new_ hiter('C'); _prior = .; do while (hi.next()=0); _rc = C.remove (key:_prior); * assignment prevents error message on first iteration; _prior=item; end; _rc = C.remove (key:_prior); hi.delete(); *--- this loopage will be C.CLEAR() in v9.2; drop _: ; run; ----------------------------- Richard A. DeVenezia -- Learn how to customize SAS Explorer http://www.devenezia.com/downloads/sas/actions/ |
|
|||
|
Yiyu wrote:
> Thanks Richard. > > But unfortunately the object dot syntax is only available in V 9.0, > while my SAS version is 8.2 Is there any other way to do it? Yiyu: You can preprocess the data and assign your own numeric itemId to each item. When the item identifier is a reasonably sized numeric, then arrays can be used in place of hashing. Friday puzzler: This sample is correct except for one if statement, where the logic is inverted. If you can fix the inversion you will have your answer *and* you will understand the program quite well ![]() --------------------------- /* * 1. The are groups, sequentially ordered * 2. A group contains items * 3. In each group an item is eligible for inclusion in an 'active pool' * if its signal attribute is 1 * 4. The active pool has a limit * 5. Items active in previous group have preference over items first * establishing eligibility (seniority if you will) * 6. When the number of items eligible for entry into the active pool * would cause the pool to exceed its limits, the items that enter * the pool are randomly selected via k/n */ * this sample uses hash to perform lookup; * if data is sensibly sized integers, * then arrays and direct addressing could be used and would be faster; %let seed = 0DEAFx; %let nGroups = 1000; %let nItems = 1000; %let POOLSIZE = 500; data groups; do group = 1 to &nGroups; output; end; run; proc sql; create table items (item char(6)) ; create unique index item on items(item) ; quit; %let ranChar = byte(rank('A')+26*ranuni(&SEED)); data items(index=(item/unique)); do _n_ = 1 to &nItems; item = &ranChar. || &ranChar. || &ranChar. || &ranChar. || &ranChar. || &ranChar. ; output; end; stop; run; data groupedItems; set groups; do _i = 1 to _n; set items nobs=_n point=_i; signal = rannor(&SEED) > 0; output; end; run; * step 1. - create a mapping from a item to an itemId presuming there is no prior knowledge of item values. If you have an apriori table of allowed item values then work of that instead of the raw data ; proc sort data=groupedItems(keep=item) nodupkey out=myItems; by item; run; * use this if you have an apriori table; * data myItems/view=myItems; * set MASTER.ITEMS; * run; data myItemsFormat; set myItems(rename=item=start); retain fmtname '@TOID'; label+1; run; proc format cntlin=myItemsFormat; run; * step 2. - measure the data (or use your own apriori set limits); proc sql noprint; select count(*) into :nItems from myItems; quit; * step 3. - determine active pool at each group state * (using arrays) * presumes an item appears only once in a group *; data itemsWithPoolIndicator ; retain poolsize 0; array AP[&nItems] _temporary_; array CP[&nItems] _temporary_; do _i = 1 to dim(CP); CP[_i] = 0; end; * fill the candidate pool; candsize = 0; do until (last.group); set groupedItems (where=(not missing(item))); by group; itemId = input(item,TOID.); * determine state; _isActive = AP[itemId]; * add items into candidate pool or remove from active pool; if ^_isActive and signal then do; CP[itemId]=1; candsize+ 1; end; else if _isActive and ^signal then do; AP[itemId]=0; poolsize+-1; end; else if _isActive and signal then do; AP[itemId]+1; end; * this is where a dup item in a group would hurt the age accuracy; end; * move random candidates into available pool slots; _N = candsize; _K = &POOLSIZE - poolsize; if _K > 0 then do; do itemId = 1 to dim (CP) while (_N > 0); if CP[itemId] then continue; if ranuni (&SEED) < _K / _N then do; _K = _K - 1; AP[itemId]=1; poolsize + 1; end; _N = _N - 1; end; end; * mark item as in active pool; do until (last.group); set groupedItems (where=(not missing(item))); by group; itemId = input(item,TOID.); pool = AP[itemId]; OUTPUT; end; drop _: itemId ; run; --------------------------- The 'level-up' in this type of problem is to use bit coding to track an items membership in a set. Such tracking is only feasible if the pool is to have no age memory. If you want to track age bitwise, then each item requires N bits where the max age is known to be <= 2**N -- Richard A. DeVenezia -- Learn how to customize SAS Explorer http://www.devenezia.com/downloads/sas/actions/ |
|
|||
|
Richard, I believe the inversion is here:
' if CP[itemId] then continue; ' Instead it should be if ^CP[itemId] then continue; Correct? Also what's the meaning of 'poolsize+-'? why both + and -? Using the data created in your sample the code performs as expected. But when I run it on my real data set, which has 3 million obs, the behavior is quite strange. More specifically, 'group' here is date, from 19600101 to 20031231, from around 1960 to 1963 the result is as expected, but after 1963 the eligible obs selected into pool is fewer and few, and after 1990 there 's almost only 1 obs in pool for each day. This is obviously not correct since for each date after 1990 there are at least 300-500 obs with hold = 1 and it's supposed for many days pool = 100. I attach the code here: libname yiyu '/sastemp3/yiyu'; * 1. The are groups, sequentially ordered * 2. A group contains items * 3. In each group an item is eligible for inclusion in an 'active pool' * if its signal attribute is 1 * 4. The active pool has a limit * 5. Items active in previous group have preference over items first * establishing eligibility (seniority if you will) * 6. When the number of items eligible for entry into the active pool * would cause the pool to exceed its limits, the items that enter * the pool are randomly selected via k/n %let seed = 0DEAFx; %let nItems = 20000; %let POOLSIZE = 100; * there are about 16000 stocks, I want to build a portfolio with max number of 100; * permno is stock's id; * variable Hold indicates the stock should be portfolio if no restricitons ; data yiyu.test99; set yiyu.max_3; keep date permno hold ini; run; proc sort data=yiyu.test99(keep=permno) nodupkey out=yiyu.myItems; by permno; run; data yiyu.myItems; set yiyu.myItems; itemID = _N_; * create key for each stock; run; proc sort data=yiyu.test99; by permno; run; proc sort data=yiyu.myItems; by permno; run; data yiyu.test99; merge yiyu.test99 yiyu.myItems; by permno; run; * step 3. - determine active pool at each group state * (using arrays) * presumes an item appears only once in a group *; proc sort data=yiyu.test99; by date permno; run; proc sort data= yiyu.test99; by date; run; data yiyu.PoolIndicator ; retain poolsize 0; array AP[&nItems] _temporary_; * Active Pool; array CP[&nItems] _temporary_; * Candidate Pool; do _i = 1 to dim(CP); CP[_i] = 0; end; * fill the candidate pool; candsize = 0; do until (last.date); set yiyu.test99 ; by date; * determine state; _isActive = AP[itemId]; * add items into candidate pool or remove from active pool; if ^_isActive and hold then do; CP[itemId]=1; candsize+ 1; end; else if _isActive and ^hold then do; AP[itemId]=0; poolsize+-1; end; else if _isActive and hold then do; AP[itemId]+1; end; * this is where a * dup item in a group would hurt the age accuracy; end; * move random candidates into available pool slots; _N = candsize; _K = &POOLSIZE - poolsize; if _K > 0 then do; do itemId = 1 to dim (CP) while (_N > 0); if ^CP[itemId] then continue; if ranuni(12345) < _K / _N then do; _K = _K - 1; AP[itemId]=1; poolsize + 1; end; _N = _N - 1; end; end; * mark item as in active pool; do until (last.date); set yiyu.test99 ; by date; pool = AP[itemId]; c_pool = CP[itemId]; OUTPUT; end; run; Thanks a lot! Yiyu |
|
|||
|
For code of core data step 'yiyu.PoolIndicator' I changed nothing but
the name of variables, and used merge instead of Proc Format to create indices, it's really weird why the code performs correctly on simulated data sets but not on my own data set. |
|
|||
|
Yiyu wrote:
> Instead it should be if ^CP[itemId] then continue; > Correct? Yes, correct. > For code of core data step 'yiyu.PoolIndicator' I changed nothing but > the name of variables, and used merge instead of Proc Format to create > indices, it's really weird why the code performs correctly on > simulated data sets but not on my own data set. sort,data,sort,sort,merge,sort,sort is extraneous. A sort,data,sort would work equally well, and you can check for dup permno within date. proc sort data=yiyu.test99; by permno date; run; data yiyu.test99; set data yiyu.test99 end=end; by permno date; if first.permno then itemId + 1; if first.date and not last.date then put permno= ' appears more than once in ' date= ; if end then call symput ('nItems', put(itemId,12.-L)); run; proc sort data=yiyu.test99; by date permno; run; Discussions of k/n in SAS go back at least ten years. Look for Dr. John Whittigon threads. The Sep 15-24, 1999 thread was illuminating. As for the problem, it is likely the slight miscoding of the k/n algorithm is impacting you. The test ranuni() < k/n should be a <=, as in if ranuni(&seed) <= _K/_N then do; ..... end; Let us know if this seems to fix the problem. -- Richard A. DeVenezia |
|
|||
|
Hi, Richard, I figure out where the problem is and it's not related
with k/n, the problem is actually with the way of table-lookup. The difference between the artificial sample in the code sample and my real data is that in the artificial sample each group has every item, so when we assign the key itemID to each item and look up them using AP[itemID] we are sure we are looking at the status of exact same item. If C is assign key 3 , then ' if ^AP[3] ' is always checking whether C is in the pool. But in my real data each date has different groups of permno, previous permno will often be dropped in later dates, it's actually like this: Date Permno Hold 19900101 10001 1 19900101 10002 0 19900101 10003 0 19900102 10001 0 19900102 10004 0 19900102 10005 1 19900102 10006 1 So following the code _isActive = AP[itemId]; if ^_isActive at 19900102, when ItemID = 3 we want to check if 10005 is active and whether its Hold =1 , but actually we checked whether 10003 is active and hold = 1, since 10003 already dropped out of sample, no wonder we missed one obs which should have been added to pool. And for the whole 19900102 by last.date we checked 4 obs, though we are supposed to check 10001, 10004, 10005, 10006, we acutally checked 10001, 10002, 10003, 10004. Since with date goes further, more earlier permno's are dropped out of sample, no wonder the pool will be smaller and smaller, since we are checking more and more permno's not in the group. So instead of using if ^AP[ItemID] we actually need something like ^AP[permno], i.e. we need to directly check if the stock with the permno is active, or hold =1, etc ..., put it another way, we need to use permno as the key directly. Any way to do this? Thanks! Yiyu |
|
|
![]() |
| Thread Tools | |
| Display Modes | |
|
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Re: wierd selection problem | nospam@HOWLES.COM (Howard Schreier | Newsgroup comp.soft-sys.sas | 0 | 05-15-2007 02:30 AM |
| Re: wierd selection problem | Sigurd Hermansen | Newsgroup comp.soft-sys.sas | 0 | 05-14-2007 05:42 PM |
| Re: a sas problem bout regression | David L Cassell | Newsgroup comp.soft-sys.sas | 0 | 04-23-2007 05:49 AM |
| Re: Usefulness of a text editor? | Peter Crawford | Newsgroup comp.soft-sys.sas | 0 | 07-08-2005 03:22 PM |
| a conditional selection problem with cap | Yiyu | Newsgroup comp.soft-sys.sas | 2 | 05-05-2005 12:45 AM |