|
|||
|
On Nov 23, 11:59 pm, John Passaniti <john.passan...@gmail.com> wrote:
> On Nov 22, 11:23 pm, Hugh Aguilar <hugoagui...@rosycrew.com> wrote: > > > I recently ported my symtab.factor program to Forth (and fixed a bug > > in it). You couldn't figure out symtab last time I posted it, but > > maybe you'll get it this time. Here is the Factor version again: > > >http://www.rosycrew.com/symtab.factor > > I would prefer that you revert the code back to what I casually > reviewed and *then* tell me precisely what I got wrong. Note the word > "precisely" Here is a Factor program: www.rosycrew.com/symtab.factor Here is a Forth program: www.rosycrew.com/symtab.4th www.rosycrew.com/novice.4th The novice.4th program contains some support functions needed by symtab.4th. The novice.4th program isn't complete; I am adding new functions to it and will later make it available to novices to help them get started. What we have here is just a step in that direction. You had said that the nodes in the tree are sorted according to their frequency of access. This isn't true; they are sorted alphabetically, the same as in any binary tree. The symtab tree is *structured* according to frequency of access. The most frequently accessed node is the root, the next-most frequently accessed nodes are its children, and so on down to the leaves, which are the least frequently accessed nodes. The idea is to provide fast access to words such as DUP that are used a lot, and slow access to words such as QUIT and BYE that are used less. The result should be an overall improvement in speed. Hash tables are the most popular data structure used for symbol tables nowadays. These also have faster access for some nodes, and slower access for other nodes (the nodes that have to be resolved through linked lists). The problem is that *which* nodes are fast or slow is pretty much a matter of chance, although the earlier-defined nodes tend to be faster than the more recent ones. Bad luck might give you slow access on DUP and fast access on BYE. Hash tables are popular these days because computers have a lot of memory. The traditional complaint against hash tables was that they use too much memory, and therefore are a bad choice for 16-bit computers. The memory problem is still an issue with 32-bit computers though. Consider the case in which each wordlist has its own hast table. Also consider the case in which the programmer uses associative arrays (as done in Lua and AWK and various other high-level languages). Every hash table is going to be gigantic, even if it only has a few dozen elements in it. The compiler doesn't know how many elements are going to be inserted into the hash table, and so it must make each hash table large enough that it won't overflow in the worst- case scenario that this will be the one that gets thousands of elements. By comparison, my symtab data structure doesn't have any up- front overhead. It does have more per-node overhead though, as each node has a left and right pointer field, and a count field. In symtab, the size of the data structure is linearly related to the number of elements. The same amount of memory is going to be used whether you have one symtab with X elements in it, or N symtabs with an average of X/N elements in each. No, I'm not going to reintroduce a bug that I have already fixed. Why would I want to do that? It was a pretty obscure bug that only came up when a node was inserted into the tree with the same key as an existing node. The bug was that I smudged the new node rather than the old node. In a Forth dictionary, the old node is supposed to be smudged and the new node is supposed to be inserted into the dictionary, typically with a "redefinition" warning. Fixing this bug didn't change the symtab algorithm. I stand by what I said previously, that the algorithm used in symtab is my own invention. I have never seen any published algorithm that structured the tree according to frequency of access. All of the binary tree algorithms that I have read about restructure the tree with the idea of balancing the tree. By comparison, symtab trees aren't balanced and it is possible for them to be quite lopsided. If there is no relationship between alphabetic ordering and frequency of access however, then the tree will be pretty much balanced. In regard to a Forth dictionary, the names of the words are chosen for readability and so there shouldn't be any relation. In an associative array, the names might be generated according to some pattern ("1st", "2nd", "3rd", "4th", etc.) and so there would be a relation. I'm not going to worry about it. |
|
|
||||
|
||||
|
|
|
|||
|
Hugh Aguilar wrote:
> Hash tables are popular these days because computers have a lot of > memory. The traditional complaint against hash tables was that they > use too much memory, and therefore are a bad choice for 16-bit > computers. The memory problem is still an issue with 32-bit computers > though. Consider the case in which each wordlist has its own hast > table. In Gforth, we use a single hash table for all wordlists. Different wordlists have a different "seed" into the hash table, i.e. when DUP is in wordlist 0 (the first one), it would go to slot 316 (just an example), if it's in wordlist 3, it would go to slot 319, you get the basic idea (we use XOR). The size of the hash table is the next power of two larger than the total number of entries. That way, you won't have too many collisions, and not waste too much space. If the number of entries reaches the critical value, the hash-table size is doubled and the table re-populated. In case of collisions, later entries are going to be faster, since they get to the head of the chain (that's Forth's search semantics). To avoid slowdowns by too many redefinitions (all of them will end up in the same bucket, and would make access to the tail of the chain slower), we could eliminate duplicates, but the current implementation just doesn't worry about that. I've experimented with AVL tress (balanced binary trees), and the benchmark conclusion was that a) AVL trees are harder to implement b) they are significantly slower than the hashes as implemented above for larger vocabularies (O(log(N)) instead of O(1)) -- Bernd Paysan "If you want it done right, you have to do it yourself" http://www.jwdt.com/~paysan/ |
|
|||
|
Hugh Aguilar <hugoaguilar@rosycrew.com> wrote:
> Hash tables are the most popular data structure used for symbol > tables nowadays. These also have faster access for some nodes, and > slower access for other nodes (the nodes that have to be resolved > through linked lists). Sure, but if those linked lists are long then the hash table much too small. Also, you can order those lists by frequencey of access. I would have thought that for symbol tables it makes much more sense to use open addressing. I've always been a bit mystified by the poularity of collision resolution by chaining. > Every hash table is going to be gigantic, even if it only has a few > dozen elements in it. The compiler doesn't know how many elements > are going to be inserted into the hash table, and so it must make > each hash table large enough that it won't overflow in the worst- > case scenario that this will be the one that gets thousands of > elements. Hash tables can be resized. > I have never seen any published algorithm that structured the tree > according to frequency of access. It's in Knuth, Section 6.2.2. Andrew. |
|
|||
|
On Nov 24, 5:07*pm, Hugh Aguilar <hugoagui...@rosycrew.com> wrote:
> You had said that the nodes in the tree are sorted according to their > frequency of access. This isn't true; they are sorted alphabetically, > the same as in any binary tree. I had also wrote that (1) I am new to Factor, (2) I only casually reviewed your code, (3) that I didn't find your code's comments clear, and that as a result (4) I might not understand your code. I invited you to explain what I got wrong. You didn't respond. This was 49 days ago. > The symtab tree is *structured* according to frequency of access. Oh, like a splay tree. Why didn't you say so? Probably because you don't know what a splay tree is or the larger class of ordered data structures that adjust themselves based on frequency of access. And not knowing leads to statements from you like the following: > I stand by what I said previously, that the algorithm used in symtab > is my own invention. That may be true. Every day, programmers who are unaware of the work of others often come up with sub-optimal implementations of well-known algorithms. Your algorithm wastes both space (storing) and time (comparing) an explicit count in each node. Splay trees use the actual act of searching the tree to identify which nodes to move; no explicit count is necessary. > I have never seen any published algorithm that > structured the tree according to frequency of access. That may be true. You may not have access to Knuth's third volume of "The Art of Computer Programming" or the title "Self-Adjusting Binary Search Trees" in the original article describing splay trees back in 1985 (Journal of the ACM) might not have piqued your interest. And depending on what keywords you tossed at various search engines, the entire range of academic research into the larger class of data structures (including various kinds of trees) that adjust themselves based on frequency of access could have been hidden from your view. > All of the binary tree algorithms that I have read about restructure > the tree with the idea of balancing the tree. Again, that may be true. I have no idea how much effort you put into a search. |
|
|||
|
On Nov 25, 5:10*am, Andrew Haley <andre...@littlepinkcloud.invalid>
wrote: > I would have thought that for symbol tables it makes much more sense > to use open addressing. *I've always been a bit mystified by the > poularity of collision resolution by chaining. No real mystery: As the load factor of the hash table increases, the cost of probing around to insert or find a key can get quite expensive. In applications where hashes play a critical role (such as many scripting languages that do dynamic lookup of functions and other data), that expense is going to directly impact on the speed of the language. The usual resolution to this is to increase the size of the hash when the load factor crosses some threshold and then rehash everything. But that itself can be expensive. The popularity of chaining probably comes from two things. First, as the load factor of the hash increases, the cost of lookups tends to grow linearly (as opposed to open addressing, where the cost can, depending on the hash quality and probing strategy, grow much faster). Second, by ordering the data in the chains, it's possible to use faster techniques (like skip lists, binary searches, etc.) to reduce the search cost within the chain. In other words, the conceptual simplicity of open addressing comes at the cost of unpredictable access times mixed with the waste of unused space to keep the load factor low. Chaining is more complex, but that complexity pays off in terms of predictability and (potentially) less wasted space, which in some applications matters more. |
|
|||
|
On Nov 25, 12:32*pm, John Passaniti <john.passan...@gmail.com> wrote:
> On Nov 24, 5:07*pm, Hugh Aguilar <hugoagui...@rosycrew.com> wrote: > > The symtab tree is *structured* according to frequency of access. > > Oh, like a splay tree. *Why didn't you say so? *Probably because you > don't know what a splay tree is or the larger class of ordered data > structures that adjust themselves based on frequency of access. *And > not knowing leads to statements from you like the following: > > > I stand by what I said previously, that the algorithm used in symtab > > is my own invention. > > That may be true. *Every day, programmers who are unaware of the work > of others often come up with sub-optimal implementations of well-known > algorithms. Nobody here has shown me any evidence to indicate that my symtab is sub-optimal. > Your algorithm wastes both space (storing) and time (comparing) an > explicit count in each node. *Splay trees use the actual act of > searching the tree to identify which nodes to move; no explicit count > is necessary. If splay trees don't store a count the way that my symtab trees do, then they aren't going to be as efficient, because they have less information to work with. It seems unrealistic that any algorithm is going to minimize both space and time. More likely, there is a trade- off. These splay trees minimize space (no count field) at a cost in time, whereas my symtab trees minimize time (they are optimal) at a cost in space (a count field). Also, I don't think that comparing integers takes up much time. I'll do some research on these splay-trees and get back to you with a comparison. I still feel confident that, whatever splay-trees are, my symtab will be more efficient. If splay trees don't have a count field, then they have only a vague guess of how many times each node has been accessed. How much information can be stored in zero bits? I'm not an expert on information theory, but I think the answer would be: "None." I don't consider the memory usage of the count field to be much of an issue. Even with an extra word per node, symtab is still going to be more memory efficient than hash tables with their big sparse array. Also, a symtab tree is going to be more robust because it grows dynamically. To resize a hash table you have to reinsert all of the elements. That is very slow. It is true that I have my OPTIMIZE function that traverses the tree and optimizes it, but OPTIMIZE is quite fast. Also, OPTIMIZE is only necessary when the programmer is using a large DAMPER value. If you set the DAMPER value low you get more churning, which slows down the operation of the tree, but you don't have to run OPTIMIZE as often (maybe not at all). |
|
|||
|
John Passaniti <john.passaniti@gmail.com> wrote:
> On Nov 25, 5:10?am, Andrew Haley <andre...@littlepinkcloud.invalid> > wrote: > > I would have thought that for symbol tables it makes much more > > sense to use open addressing. I've always been a bit mystified by > > the poularity of collision resolution by chaining. > No real mystery: As the load factor of the hash table increases, the > cost of probing around to insert or find a key can get quite > expensive. Let me explain a little more more what I meant. Sometimes you see simple applications of hash tables that are not really critical to application performance, where the programmer has gone through the rigmarole of resolution by chaining, for additional code complexity and no real advantage. This contradicts the basic rule of "start by doing something simple, measure the results, and only do anything more complicated if performance really justifies it." > In applications where hashes play a critical role (such as many > scripting languages that do dynamic lookup of functions and other > data), that expense is going to directly impact on the speed of the > language. The usual resolution to this is to increase the size of > the hash when the load factor crosses some threshold and then rehash > everything. But that itself can be expensive. Well, you have to do that for any hash table. On the one hand you have the simplicity of open addressing, but its downside is that you mustn't allow the hash table ever to become more than about a third to half full. Even when a table is half full there are only on average 1.5 probes for a successful search and 2.5 probes for an unsuccessful one. > The popularity of chaining probably comes from two things. First, > as the load factor of the hash increases, the cost of lookups tends > to grow linearly (as opposed to open addressing, where the cost can, > depending on the hash quality and probing strategy, grow much > faster). Well, I'm assuming here a hash that behaves like a random function. If not, all bets are off, but I'm not at all convinced by the strategy of collision resolution by chaining as a way to mitigate a bad hash function! > Second, by ordering the data in the chains, it's possible to use > faster techniques (like skip lists, binary searches, etc.) to reduce > the search cost within the chain. > In other words, the conceptual simplicity of open addressing comes > at the cost of unpredictable access times mixed with the waste of > unused space to keep the load factor low. Chaining is more complex, > but that complexity pays off in terms of predictability and > (potentially) less wasted space, which in some applications matters > more. I don't agree about predictability: as long as the table is kept reasonably empty, there's no difference in predictability between the two approaches. I accept that If you want a really dense hash table in memory and you're prepared to put up with the extra complexity, there may be something to be said for chaining. Andrew. |
|
|||
|
Andrew Haley <andrew29@littlepinkcloud.invalid> writes:
>Sometimes you see >simple applications of hash tables that are not really critical to >application performance, where the programmer has gone through the >rigmarole of resolution by chaining, for additional code complexity >and no real advantage. Rigmarole? Additional code complexity? I find a hash table with external chaining to be very simple to implement: it's a little demuxing code tacked in front of a linked-list implementation. In contrast, with open addressing the implementation is much more involved: The text books say that I should compute another, different hash function for dealing with the collisions, and I have to deal with the possibility of having more entries than slots. With chaining having a higher load factor results in degraded performance, with open addressing it results in failure; so I have to implement table overflow handling even in cases where a simple fixed-size table would be good enough with chaining. >This contradicts the basic rule of "start by >doing something simple, measure the results, and only do anything more >complicated if performance really justifies it." Yes, that's why I have always stuck to chaining. >I accept that If you want a really dense hash table >in memory and you're prepared to put up with the extra complexity, >there may be something to be said for chaining. I would have said exactly the same thing about open addressing. With chaining, I always waste as many pointers as the table contains. With that many free slots, open addressing can do quite well; the complexity is in the implementation. - anton -- M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html New standard: http://www.forth200x.org/forth200x.html EuroForth 2009: http://www.euroforth.org/ef09/ |
|
|||
|
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
> Andrew Haley <andrew29@littlepinkcloud.invalid> writes: > >Sometimes you see simple applications of hash tables that are not > >really critical to application performance, where the programmer > >has gone through the rigmarole of resolution by chaining, for > >additional code complexity and no real advantage. > Rigmarole? Additional code complexity? I find a hash table with > external chaining to be very simple to implement: it's a little > demuxing code tacked in front of a linked-list implementation. Well, yeah, as opposed to a little demuxing code. > In contrast, with open addressing the implementation is much more > involved: The text books say that I should compute another, different > hash function for dealing with the collisions, Double hashing is only one option. > and I have to deal with the possibility of having more entries than > slots. As I said previously, I assume that you're going to have to resize eventually, no matter what kind of hash table you use. If that assumption doesn't hold, then you will get into a situation where every slot is full, and then resolution by chaining is the only possibility. Clearly in that situation you can't possibly use open addressing. Andrew. |
|
|||
|
Andrew Haley <andrew29@littlepinkcloud.invalid> writes:
>Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote: >> Andrew Haley <andrew29@littlepinkcloud.invalid> writes: > >> >Sometimes you see simple applications of hash tables that are not >> >really critical to application performance, where the programmer >> >has gone through the rigmarole of resolution by chaining, for >> >additional code complexity and no real advantage. > >> Rigmarole? Additional code complexity? I find a hash table with >> external chaining to be very simple to implement: it's a little >> demuxing code tacked in front of a linked-list implementation. > >Well, yeah, as opposed to a little demuxing code. Plus a little collision handling code (both on lookup and insert; only on lookup with chaining); plus a little code for resizing. - anton -- M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html New standard: http://www.forth200x.org/forth200x.html EuroForth 2009: http://www.euroforth.org/ef09/ |
|
|||
|
Anton Ertl wrote:
> In contrast, with open addressing the implementation is much more > involved: The text books say that I should compute another, different > hash function for dealing with the collisions, and I have to deal with > the possibility of having more entries than slots. With chaining > having a higher load factor results in degraded performance, with open > addressing it results in failure; so I have to implement table > overflow handling even in cases where a simple fixed-size table would > be good enough with chaining. Another note: Gforth's way to deal with multiple vocabularies in just one shared hash table won't work with open addressing. And our method is much more practical with the grow-your-hash-table-when-needed approach: Since we have only one hash table, it's already large, and doesn't have to grow often. -- Bernd Paysan "If you want it done right, you have to do it yourself" http://www.jwdt.com/~paysan/ |
|
|||
|
Bernd Paysan <bernd.paysan@gmx.de> writes:
>Another note: Gforth's way to deal with multiple vocabularies in just one >shared hash table won't work with open addressing. Why not? Because our method relies on separate buckets staying separate, whereas in open addressing you jump to another bucket if the first is full. One could work around that by storying the wordlist number with each word, but that costs extra memory. Another advantage of chaining in the context of Forth is that it provides easy and natural shadowing of older definitions with the same name, because it keeps this property of linked lists. However, IIRC we don't really make use of that in Gforth, because we always rebuild the hash table when executing a marker. >And our method is much >more practical with the grow-your-hash-table-when-needed approach: Since we >have only one hash table, it's already large, and doesn't have to grow >often. But when it grows, that's much more expensive. Still, if we used separate hash tables, when somebody does 10 wordlists, each with 100 words, 640 of these 1000 words will have gone through one doubling, 320 of these 640 through two doublings, etc.; in contrast, because we supersized the hash table from the start (one doubling beyond what the algorithm would do by itself), none of the words would have to go through any doubling with our scheme. Another advantage is that there is less meta-data overhead: no need to store a pointer, table size and population size for every wordlist. - anton -- M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html New standard: http://www.forth200x.org/forth200x.html EuroForth 2009: http://www.euroforth.org/ef09/ |
|
|||
|
On Nov 26, 2:19 am, Hugh Aguilar <hugoagui...@rosycrew.com> wrote:
> Nobody here has shown me any evidence to indicate that my symtab is > sub-optimal. At best, the only thing you're able to claim with regard to optimality is that at any particular moment in time, the nodes are ideally arranged given some *past* cumulative frequency of access. That isn't a very useful measurement. Nobody cares about the optimality of the tree *after* some number of symbols have been accessed. What matters are the basic statistical measures (sum, mean, minimum, maximum, standard deviation) of the access times leading up to that point (and the same statistics afterward). If your measure of optimality is how accurately the tree's shape reflects the frequency distribution of symbols, then congratulations-- you've apparently solved a problem that nobody cares about. You're also putting a tremendous amount of faith that the cumulative frequency distribution at any point in time actually matters. What seems more likely to me is that one cares far less about the overall frequency distribution than the temporal pattern of access. Say I have a symbol accessed more than any other during a run, but only near the start of the run. Your scheme will promote that symbol to the root, where it will uselessly remain there during the remainder of the run. How is that helpful? How is that adaptive? > If splay trees don't store a count the way that my symtab trees do, > then they aren't going to be as efficient, because they have less > information to work with. The insight that you're missing is that searching the tree is itself information. It's the identification of a "hot" path of pointers to a desired node. No count is necessary to store or later compare. You are using a singular definition of efficiency based on the tree's shape accurately reflecting the frequency distribution of symbols. Ignoring for the moment that I question the very premise of that being useful in the kinds of applications one typically wants a symbol table for, it completely ignores the cost of restructuring the tree. Let's compare: 1) When some threshold is reached, a symtab decides to stop the world and restructure the tree. This requires iterating over the *entire* tree, juggling nodes around. As the tree grows, that is a increasingly expensive operation in terms of processor time. 2) A splay tree optimizes only along the "hot" path of pointers to a node when searching. The rest of the tree is unaffected and retains the shape that past patterns of access left in those untouched parts of the tree. The end result is that splay trees do less work than your symtab and that work done is amortized over time. In some variations of splay trees, the work done is even less because the searched node isn't moved to the root, but just promoted closer to it. > It seems unrealistic that any algorithm is > going to minimize both space and time. More likely, there is a trade- > off. These splay trees minimize space (no count field) at a cost in > time, whereas my symtab trees minimize time (they are optimal) at a > cost in space (a count field). Again, you've optimized for something nobody cares about. What good is having an optimal tree structure in memory when you've just finished searching the tree and the program is ending? What good is having an optimal tree structure based on frequency if the frequency of symbols no longer represents the temporal pattern of access? Splay trees make no claim (and thus, put forth no effort) in ensuring that the tree at any point in time is globally optimal. Instead, splay trees are concerned with a looser local, temporal optimality. And they do so amortizing the cost over time, making access times more evenly distributed than your "hold on while I stop the world and restructure the whole tree" strategy. > Also, I don't think that comparing integers takes up much time. Not comparing integers takes up even less time. And less code. And less complexity. And less opportunity for error. And no need for code that tests to see if count exceeds a fixnum and then iterates over the entire tree, dividing the count by two. > I'll do some research on these splay-trees and get back to you with a > comparison. I still feel confident that, whatever splay-trees are, my > symtab will be more efficient. If splay trees don't have a count > field, then they have only a vague guess of how many times each node > has been accessed. How much information can be stored in zero bits? The information is stored in the path of pointers leading to the node. > I'm not an expert on information theory, but I think the answer would > be: "None." Yes, it's clear you aren't an expert on information theory. It's probably because you have a rather limited definition of "information." > I don't consider the memory usage of the count field to be much of an > issue. Even with an extra word per node, symtab is still going to be > more memory efficient than hash tables with their big sparse array. Most real-world implementations of hash tables start small and then grow as necessary. > Also, a symtab tree is going to be more robust because it grows > dynamically. To resize a hash table you have to reinsert all of the > elements. False. Many hash table implementation do indeed reinsert all the elements because that is simple. But it isn't necessary. Linear hashing is a technique that allows for growing a hash table one node at a time, and it only requires rehashing the elements in a single collision chain. Other techniques like "consistent hashing" works by partitioning the storage space, and only requires reinsertion of the elements that fall in that space. > That is very slow. False. Hash tables are used in real-time situations where consistent speed matters (and where your symtab would fail). This is done by amortizing the cost of reinserting nodes of over time. One common technique to resizing a hash table that has real-time constraints is to allocate a new empty table, and then use it for primary searches, falling back to the original table when searches fail. New insertions occur in the new table, and nodes are incrementally moved from the original to the new table, bound to an upper limit (such as number of nodes or an interval of time). When finished, you delete the original table. > It is true that I have my OPTIMIZE > function that traverses the tree and optimizes it, but OPTIMIZE is > quite fast. Also, OPTIMIZE is only necessary when the programmer is > using a large DAMPER value. If you set the DAMPER value low you get > more churning, which slows down the operation of the tree, but you > don't have to run OPTIMIZE as often (maybe not at all). Quite fast? I'm sorry, I'm unfamiliar with that unit of measure. Help me convert that to processor cycles per node so that I can meaningfully decide if what you think is "quite fast" is suitable for any particular application? Thanks. Here's your homework. It has six parts. This really shouldn't take much time: 1) Your data set at the end of your code is cute. Let's instead pick something more useful and representative. You call this a symbol table, so let's throw real-world symbols at it. Let's take the symbols from source code. Factor is kind of cool, so let's use it. When I look at the current distribution for Linux, I see there are 3177 files named "*.factor". I want you to iterate through those files and put the symbols you find in those files in your symtab. (I'll define a symbol as whitespace-separated tokens for ease of parsing.) Ideally, you'll eliminate comments first, but maybe that doesn't matter. Regardless, let's roughly estimate how many symbols there are: $ find -name "*.factor" | xargs cat | awk '{ for(i=1;i<NF;i++) print $i;}' | wc -w 1228623 Cool, roughly 1.2M symbols. I wonder how many of those are unique? $ find -name "*.factor" | xargs cat | awk '{for(i=1;i<=NF;i++) print $i;}' | sort -u | wc -w 106714 So we're talking in the neighborhood of 100k unique symbols to store in your symtab. That should tell us something. 2) Instrument your code so that you record the time taken to insert/ access symbols in the table. I don't know Factor, but it probably has a library routine to look at the system's high-resolution timer (or better, the performance counters in your CPU). Use the best timer you have to measure the insert/access time. Collect simple statistics; the sum, mean, minimum, and maximum. 3) Comment-out the timer additions from the previous step and replace it with a count of the number of probes it took to find the symbol. So if a symbol was found at the root, that's 1. If it was found down one level, that's 2. And so on. Again, collect simple statistics. 4) Report your numbers. Include the value(s) you used for DAMPER. Okay, that will be interesting. And now it's time for the *really* fun part: 5) Change the code to replace and increment of "count" when inserted/ accessed to storing the current time (hopefully with millisecond resolution, but okay if not). Keep the rest of the code the same. In other words, we're going to completely ignore frequency and instead reorder the tree based in terms of when a field was last used. 6) Repeat steps 2 through 4, above. And here's my fearless prediction: Getting rid of the frequency counting will reduce the average number of probes to find a symbol, probably dramatically. That's because you want more of a cache and less of a statistical model of your symbols. I don't know if this will result in a reduction of the average access time to find symbols; that time could potentially be dominated by the OPTIMIZE routine stopping the world and restructuring nodes. Have fun. |
|
|||
|
On Nov 27, 12:57 am, John Passaniti <john.passan...@gmail.com> wrote:
> You're also putting a tremendous amount of faith that the cumulative > frequency distribution at any point in time actually matters. What > seems more likely to me is that one cares far less about the overall > frequency distribution than the temporal pattern of access. Say I > have a symbol accessed more than any other during a run, but only near > the start of the run. Your scheme will promote that symbol to the > root, where it will uselessly remain there during the remainder of the > run. How is that helpful? How is that adaptive? My symtab tree restructures itself to be optimal for a program that is getting recompiled repeatedly. There is some churning in the beginning of the development of that program. Pretty soon though, the tree settles down into an optimal state and there is very little churning, which mostly occurs as the programmer writes new words and these start out at a leaf and work their way up to an appropriate place in the tree. This "temporal pattern of access" doesn't make sense to me. You seem to imply that the entire tree gets restructured during the course of compilation of a program. That is a lot of churning! > You are using a singular definition of efficiency based on the tree's > shape accurately reflecting the frequency distribution of symbols. > Ignoring for the moment that I question the very premise of that being > useful in the kinds of applications one typically wants a symbol table > for, it completely ignores the cost of restructuring the tree. I'm not "completely ignoring the cost of restructuring the tree." As I said above, the tree gradually settles down into an optimum state and there is less and less churning needed. > 1) When some threshold is reached, a symtab decides to stop the world > and restructure the tree. This requires iterating over the *entire* > tree, juggling nodes around. As the tree grows, that is a > increasingly expensive operation in terms of processor time. I have no idea what you are talking about here; there is no threshold that triggers anything to happen. My symtab never "stops the world" and restructures the entire tree. I do have my OPTIMIZE function that restructures the entire tree, but its use is optional. It is possible to set DAMPER to a value > 0 (the default is 30), in which case the tree is sub-optimal. The higher the DAMPER value, the further it is from optimality, and the lower the DAMPER value the closer the tree is to optimality, with a DAMPER value of zero providing perfect optimality. The use of OPTIMIZE is always optional though. I recommend that if you have a high DAMPER value, you should call OPTIMIZE occasionally. A good time to do this is at the end of the day when you are shutting down the system. It will "stop the world" and restructure the entire tree, but you have already gone home at this time. I do have my PRUNE function that removes all of the smudged nodes from the leaves. This is also optional. PRUNE is more important than OPTIMIZE though, as it saves memory by getting rid of all of those smudged nodes. It also speeds up the tree because searches through the tree won't have to go through all of those smudged nodes. PRUNE also halves all of the count values in the tree if they have gotten too big. This is important. If you never call PRUNE, the count values will eventually overflow the single-precision integers that they are stored in. This may take several months to happen, but it will happen eventually, so you do want to call PRUNE occasionally --- when you are going home at the end of the day is a good time. Fortunately, PRUNE is extremely fast because it doesn't move nodes around at all --- it just traverses the tree. Your comments above, about stopping the world to restructure the entire tree, and about some kind of threshold, imply that you don't understand how symtab works. Study the code --- you'll figure it out eventually. > 1) Your data set at the end of your code is cute. Let's instead pick > something more useful and representative. You call this a symbol > table, so let's throw real-world symbols at it. Let's take the > symbols from source code. Factor is kind of cool, so let's use it. Symtab normally initializes each node with a count value of 1. If you have a-priori knowledge of what symbols are going to be used, you can initialize each node with a count value that you provide. Assuming that your a-priori knowledge is a good predictor of the future, this will get the symtab tree into an optimal structure a lot sooner than if every node starts out at 1. In either case though, the tree will eventually become optimal --- it just gets there sooner with some help up front. I read in Forth Dimensions a long time ago that somebody did some statistics over a large number of Forth programs to determine which are the most commonly used. This was way back in the days of linked lists, and the idea was to structure the linked list by order of frequency, rather than by the order that the words happened to be defined. If symtab were to be used for a Forth dictionary, collecting statistics like this would be a good idea so that the dictionary would be pretty close to optimal out-of-the-box, assuming that the user's programs will be similar to the programs written by other Forth programmers. The programmer might be different than everybody else. He might, for example, write a lot of programs involving floating-point arithmetic, whereas the programs that we collected statistics on might not have used floating-point very much. In this case, there will be some churning in the early days of the guy's usage of the compiler as the floating-point words burble upwards in the tree. Pretty soon though, the tree will have adapted itself its new owner's programming style and will be optimal for him (assuming that his programming style doesn't change over time). |
|
|||
|
On Nov 24, 3:34*pm, Bernd Paysan <bernd.pay...@gmx.de> wrote:
> Hugh Aguilar wrote: > > Consider the case in which each wordlist has its own hast > > table. > > In Gforth, we use a single hash table for all wordlists. *Different > wordlists have a different "seed" into the hash table, i.e. when DUP is > in wordlist 0 (the first one), it would go to slot 316 (just an > example), if it's in wordlist 3, it would go to slot 319, you get the > basic idea (we use XOR). *The size of the hash table is the next power > of two larger than the total number of entries. *That way, you won't > have too many collisions, and not waste too much space. *If the number > of entries reaches the critical value, the hash-table size is doubled > and the table re-populated. UR/Forth did the same thing in its single hash tree; it prepended the voc# on the front of each word's name. It is a lot more efficient to put each word-list in its own data structure however. In the case of my symtab, this means that we avoid comparing our search string to a large number of nodes that are in other word-lists and therefore aren't going to match. In the case of a hash table, it means a lot fewer collisions. UR/Forth pretty much had to use a single hash table because the 16-bit 8086 uses segmented memory and so there was only one realistic size for a hash table (64K). That business of resizing and repopulating the hash tree is very slow. Also, this happens at unpredictable times and can cause the system to freeze up unexpectedly. All in all, I think that my symtab is much more robust than any hash table. > I've experimented with AVL tress (balanced binary trees), and the > benchmark conclusion was that > > a) AVL trees are harder to implement I agree that AVL trees are hard to implement, and my experience is that symtab trees are even harder to implement. Fortunately, I've already done the heavy lifting, so it should be fairly easy for anybody to integrate my symtab code into his compiler. > b) they are significantly slower than the hashes as implemented above > for larger vocabularies (O(log(N)) instead of O(1)) I agree that AVL trees are slow, but this doesn't have anything to do with symtab trees. |
|
|
![]() |
| Popular Tags in the Forum |
| symtab |
| Thread Tools | |
| Display Modes | |
|
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Factor | Hugh Aguilar | Newsgroup comp.lang.forth | 15 | 10-07-2009 09:31 PM |