Go Back   Rhinocerus > Newsgroup > Newsgroup comp.lang.* 1 > Newsgroup comp.lang.forth

Reply
 
Thread Tools Display Modes
  #1 (permalink)  
Old 11-24-2009, 09:07 PM
Hugh Aguilar
Guest
 
Posts: n/a
Default symtab

On Nov 23, 11:59 pm, John Passaniti <john.passan...@gmail.com> wrote:
> On Nov 22, 11:23 pm, Hugh Aguilar <hugoagui...@rosycrew.com> wrote:
>
> > I recently ported my symtab.factor program to Forth (and fixed a bug
> > in it). You couldn't figure out symtab last time I posted it, but
> > maybe you'll get it this time. Here is the Factor version again:

>
> >http://www.rosycrew.com/symtab.factor

>
> I would prefer that you revert the code back to what I casually
> reviewed and *then* tell me precisely what I got wrong. Note the word
> "precisely"


Here is a Factor program:
www.rosycrew.com/symtab.factor
Here is a Forth program:
www.rosycrew.com/symtab.4th
www.rosycrew.com/novice.4th

The novice.4th program contains some support functions needed by
symtab.4th. The novice.4th program isn't complete; I am adding new
functions to it and will later make it available to novices to help
them get started. What we have here is just a step in that direction.

You had said that the nodes in the tree are sorted according to their
frequency of access. This isn't true; they are sorted alphabetically,
the same as in any binary tree. The symtab tree is *structured*
according to frequency of access. The most frequently accessed node is
the root, the next-most frequently accessed nodes are its children,
and so on down to the leaves, which are the least frequently accessed
nodes. The idea is to provide fast access to words such as DUP that
are used a lot, and slow access to words such as QUIT and BYE that are
used less. The result should be an overall improvement in speed.

Hash tables are the most popular data structure used for symbol tables
nowadays. These also have faster access for some nodes, and slower
access for other nodes (the nodes that have to be resolved through
linked lists). The problem is that *which* nodes are fast or slow is
pretty much a matter of chance, although the earlier-defined nodes
tend to be faster than the more recent ones. Bad luck might give you
slow access on DUP and fast access on BYE.

Hash tables are popular these days because computers have a lot of
memory. The traditional complaint against hash tables was that they
use too much memory, and therefore are a bad choice for 16-bit
computers. The memory problem is still an issue with 32-bit computers
though. Consider the case in which each wordlist has its own hast
table. Also consider the case in which the programmer uses associative
arrays (as done in Lua and AWK and various other high-level
languages). Every hash table is going to be gigantic, even if it only
has a few dozen elements in it. The compiler doesn't know how many
elements are going to be inserted into the hash table, and so it must
make each hash table large enough that it won't overflow in the worst-
case scenario that this will be the one that gets thousands of
elements. By comparison, my symtab data structure doesn't have any up-
front overhead. It does have more per-node overhead though, as each
node has a left and right pointer field, and a count field. In symtab,
the size of the data structure is linearly related to the number of
elements. The same amount of memory is going to be used whether you
have one symtab with X elements in it, or N symtabs with an average of
X/N elements in each.

No, I'm not going to reintroduce a bug that I have already fixed. Why
would I want to do that? It was a pretty obscure bug that only came up
when a node was inserted into the tree with the same key as an
existing node. The bug was that I smudged the new node rather than the
old node. In a Forth dictionary, the old node is supposed to be
smudged and the new node is supposed to be inserted into the
dictionary, typically with a "redefinition" warning. Fixing this bug
didn't change the symtab algorithm.

I stand by what I said previously, that the algorithm used in symtab
is my own invention. I have never seen any published algorithm that
structured the tree according to frequency of access.

All of the binary tree algorithms that I have read about restructure
the tree with the idea of balancing the tree. By comparison, symtab
trees aren't balanced and it is possible for them to be quite
lopsided. If there is no relationship between alphabetic ordering and
frequency of access however, then the tree will be pretty much
balanced. In regard to a Forth dictionary, the names of the words are
chosen for readability and so there shouldn't be any relation. In an
associative array, the names might be generated according to some
pattern ("1st", "2nd", "3rd", "4th", etc.) and so there would be a
relation. I'm not going to worry about it.
Reply With Quote
Alt Today
Advertising
 
and become member of Rhinocerus
Standard Sponsored Links

  #2 (permalink)  
Old 11-24-2009, 09:34 PM
Bernd Paysan
Guest
 
Posts: n/a
Default Re: symtab

Hugh Aguilar wrote:
> Hash tables are popular these days because computers have a lot of
> memory. The traditional complaint against hash tables was that they
> use too much memory, and therefore are a bad choice for 16-bit
> computers. The memory problem is still an issue with 32-bit computers
> though. Consider the case in which each wordlist has its own hast
> table.


In Gforth, we use a single hash table for all wordlists. Different
wordlists have a different "seed" into the hash table, i.e. when DUP is
in wordlist 0 (the first one), it would go to slot 316 (just an
example), if it's in wordlist 3, it would go to slot 319, you get the
basic idea (we use XOR). The size of the hash table is the next power
of two larger than the total number of entries. That way, you won't
have too many collisions, and not waste too much space. If the number
of entries reaches the critical value, the hash-table size is doubled
and the table re-populated.

In case of collisions, later entries are going to be faster, since they
get to the head of the chain (that's Forth's search semantics). To
avoid slowdowns by too many redefinitions (all of them will end up in
the same bucket, and would make access to the tail of the chain slower),
we could eliminate duplicates, but the current implementation just
doesn't worry about that.

I've experimented with AVL tress (balanced binary trees), and the
benchmark conclusion was that

a) AVL trees are harder to implement
b) they are significantly slower than the hashes as implemented above
for larger vocabularies (O(log(N)) instead of O(1))

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/
Reply With Quote
  #3 (permalink)  
Old 11-25-2009, 09:10 AM
Andrew Haley
Guest
 
Posts: n/a
Default Re: symtab

Hugh Aguilar <hugoaguilar@rosycrew.com> wrote:

> Hash tables are the most popular data structure used for symbol
> tables nowadays. These also have faster access for some nodes, and
> slower access for other nodes (the nodes that have to be resolved
> through linked lists).


Sure, but if those linked lists are long then the hash table much too
small. Also, you can order those lists by frequencey of access.

I would have thought that for symbol tables it makes much more sense
to use open addressing. I've always been a bit mystified by the
poularity of collision resolution by chaining.

> Every hash table is going to be gigantic, even if it only has a few
> dozen elements in it. The compiler doesn't know how many elements
> are going to be inserted into the hash table, and so it must make
> each hash table large enough that it won't overflow in the worst-
> case scenario that this will be the one that gets thousands of
> elements.


Hash tables can be resized.

> I have never seen any published algorithm that structured the tree
> according to frequency of access.


It's in Knuth, Section 6.2.2.

Andrew.
Reply With Quote
  #4 (permalink)  
Old 11-25-2009, 06:32 PM
John Passaniti
Guest
 
Posts: n/a
Default Re: symtab

On Nov 24, 5:07*pm, Hugh Aguilar <hugoagui...@rosycrew.com> wrote:
> You had said that the nodes in the tree are sorted according to their
> frequency of access. This isn't true; they are sorted alphabetically,
> the same as in any binary tree.


I had also wrote that (1) I am new to Factor, (2) I only casually
reviewed your code, (3) that I didn't find your code's comments clear,
and that as a result (4) I might not understand your code. I invited
you to explain what I got wrong. You didn't respond. This was 49
days ago.

> The symtab tree is *structured* according to frequency of access.


Oh, like a splay tree. Why didn't you say so? Probably because you
don't know what a splay tree is or the larger class of ordered data
structures that adjust themselves based on frequency of access. And
not knowing leads to statements from you like the following:

> I stand by what I said previously, that the algorithm used in symtab
> is my own invention.


That may be true. Every day, programmers who are unaware of the work
of others often come up with sub-optimal implementations of well-known
algorithms.

Your algorithm wastes both space (storing) and time (comparing) an
explicit count in each node. Splay trees use the actual act of
searching the tree to identify which nodes to move; no explicit count
is necessary.

> I have never seen any published algorithm that
> structured the tree according to frequency of access.


That may be true. You may not have access to Knuth's third volume of
"The Art of Computer Programming" or the title "Self-Adjusting Binary
Search Trees" in the original article describing splay trees back in
1985 (Journal of the ACM) might not have piqued your interest. And
depending on what keywords you tossed at various search engines, the
entire range of academic research into the larger class of data
structures (including various kinds of trees) that adjust themselves
based on frequency of access could have been hidden from your view.

> All of the binary tree algorithms that I have read about restructure
> the tree with the idea of balancing the tree.


Again, that may be true. I have no idea how much effort you put into
a search.



Reply With Quote
  #5 (permalink)  
Old 11-25-2009, 07:32 PM
John Passaniti
Guest
 
Posts: n/a
Default Re: symtab

On Nov 25, 5:10*am, Andrew Haley <andre...@littlepinkcloud.invalid>
wrote:
> I would have thought that for symbol tables it makes much more sense
> to use open addressing. *I've always been a bit mystified by the
> poularity of collision resolution by chaining.


No real mystery: As the load factor of the hash table increases, the
cost of probing around to insert or find a key can get quite
expensive. In applications where hashes play a critical role (such as
many scripting languages that do dynamic lookup of functions and other
data), that expense is going to directly impact on the speed of the
language. The usual resolution to this is to increase the size of the
hash when the load factor crosses some threshold and then rehash
everything. But that itself can be expensive.

The popularity of chaining probably comes from two things. First, as
the load factor of the hash increases, the cost of lookups tends to
grow linearly (as opposed to open addressing, where the cost can,
depending on the hash quality and probing strategy, grow much
faster). Second, by ordering the data in the chains, it's possible to
use faster techniques (like skip lists, binary searches, etc.) to
reduce the search cost within the chain.

In other words, the conceptual simplicity of open addressing comes at
the cost of unpredictable access times mixed with the waste of unused
space to keep the load factor low. Chaining is more complex, but that
complexity pays off in terms of predictability and (potentially) less
wasted space, which in some applications matters more.

Reply With Quote
  #6 (permalink)  
Old 11-26-2009, 06:19 AM
Hugh Aguilar
Guest
 
Posts: n/a
Default Re: symtab

On Nov 25, 12:32*pm, John Passaniti <john.passan...@gmail.com> wrote:
> On Nov 24, 5:07*pm, Hugh Aguilar <hugoagui...@rosycrew.com> wrote:
> > The symtab tree is *structured* according to frequency of access.

>
> Oh, like a splay tree. *Why didn't you say so? *Probably because you
> don't know what a splay tree is or the larger class of ordered data
> structures that adjust themselves based on frequency of access. *And
> not knowing leads to statements from you like the following:
>
> > I stand by what I said previously, that the algorithm used in symtab
> > is my own invention.

>
> That may be true. *Every day, programmers who are unaware of the work
> of others often come up with sub-optimal implementations of well-known
> algorithms.


Nobody here has shown me any evidence to indicate that my symtab is
sub-optimal.

> Your algorithm wastes both space (storing) and time (comparing) an
> explicit count in each node. *Splay trees use the actual act of
> searching the tree to identify which nodes to move; no explicit count
> is necessary.


If splay trees don't store a count the way that my symtab trees do,
then they aren't going to be as efficient, because they have less
information to work with. It seems unrealistic that any algorithm is
going to minimize both space and time. More likely, there is a trade-
off. These splay trees minimize space (no count field) at a cost in
time, whereas my symtab trees minimize time (they are optimal) at a
cost in space (a count field). Also, I don't think that comparing
integers takes up much time.

I'll do some research on these splay-trees and get back to you with a
comparison. I still feel confident that, whatever splay-trees are, my
symtab will be more efficient. If splay trees don't have a count
field, then they have only a vague guess of how many times each node
has been accessed. How much information can be stored in zero bits?
I'm not an expert on information theory, but I think the answer would
be: "None."

I don't consider the memory usage of the count field to be much of an
issue. Even with an extra word per node, symtab is still going to be
more memory efficient than hash tables with their big sparse array.
Also, a symtab tree is going to be more robust because it grows
dynamically. To resize a hash table you have to reinsert all of the
elements. That is very slow. It is true that I have my OPTIMIZE
function that traverses the tree and optimizes it, but OPTIMIZE is
quite fast. Also, OPTIMIZE is only necessary when the programmer is
using a large DAMPER value. If you set the DAMPER value low you get
more churning, which slows down the operation of the tree, but you
don't have to run OPTIMIZE as often (maybe not at all).
Reply With Quote
  #7 (permalink)  
Old 11-26-2009, 09:42 AM
Andrew Haley
Guest
 
Posts: n/a
Default Re: symtab

John Passaniti <john.passaniti@gmail.com> wrote:
> On Nov 25, 5:10?am, Andrew Haley <andre...@littlepinkcloud.invalid>
> wrote:


> > I would have thought that for symbol tables it makes much more
> > sense to use open addressing. I've always been a bit mystified by
> > the poularity of collision resolution by chaining.


> No real mystery: As the load factor of the hash table increases, the
> cost of probing around to insert or find a key can get quite
> expensive.


Let me explain a little more more what I meant. Sometimes you see
simple applications of hash tables that are not really critical to
application performance, where the programmer has gone through the
rigmarole of resolution by chaining, for additional code complexity
and no real advantage. This contradicts the basic rule of "start by
doing something simple, measure the results, and only do anything more
complicated if performance really justifies it."

> In applications where hashes play a critical role (such as many
> scripting languages that do dynamic lookup of functions and other
> data), that expense is going to directly impact on the speed of the
> language. The usual resolution to this is to increase the size of
> the hash when the load factor crosses some threshold and then rehash
> everything. But that itself can be expensive.


Well, you have to do that for any hash table.

On the one hand you have the simplicity of open addressing, but its
downside is that you mustn't allow the hash table ever to become more
than about a third to half full. Even when a table is half full there
are only on average 1.5 probes for a successful search and 2.5 probes
for an unsuccessful one.

> The popularity of chaining probably comes from two things. First,
> as the load factor of the hash increases, the cost of lookups tends
> to grow linearly (as opposed to open addressing, where the cost can,
> depending on the hash quality and probing strategy, grow much
> faster).


Well, I'm assuming here a hash that behaves like a random function.
If not, all bets are off, but I'm not at all convinced by the strategy
of collision resolution by chaining as a way to mitigate a bad hash
function!

> Second, by ordering the data in the chains, it's possible to use
> faster techniques (like skip lists, binary searches, etc.) to reduce
> the search cost within the chain.


> In other words, the conceptual simplicity of open addressing comes
> at the cost of unpredictable access times mixed with the waste of
> unused space to keep the load factor low. Chaining is more complex,
> but that complexity pays off in terms of predictability and
> (potentially) less wasted space, which in some applications matters
> more.


I don't agree about predictability: as long as the table is kept
reasonably empty, there's no difference in predictability between the
two approaches. I accept that If you want a really dense hash table
in memory and you're prepared to put up with the extra complexity,
there may be something to be said for chaining.

Andrew.
Reply With Quote
  #8 (permalink)  
Old 11-26-2009, 10:33 AM
Anton Ertl
Guest
 
Posts: n/a
Default hash tables (was: symtab)

Andrew Haley <andrew29@littlepinkcloud.invalid> writes:
>Sometimes you see
>simple applications of hash tables that are not really critical to
>application performance, where the programmer has gone through the
>rigmarole of resolution by chaining, for additional code complexity
>and no real advantage.


Rigmarole? Additional code complexity? I find a hash table with
external chaining to be very simple to implement: it's a little
demuxing code tacked in front of a linked-list implementation.

In contrast, with open addressing the implementation is much more
involved: The text books say that I should compute another, different
hash function for dealing with the collisions, and I have to deal with
the possibility of having more entries than slots. With chaining
having a higher load factor results in degraded performance, with open
addressing it results in failure; so I have to implement table
overflow handling even in cases where a simple fixed-size table would
be good enough with chaining.

>This contradicts the basic rule of "start by
>doing something simple, measure the results, and only do anything more
>complicated if performance really justifies it."


Yes, that's why I have always stuck to chaining.

>I accept that If you want a really dense hash table
>in memory and you're prepared to put up with the extra complexity,
>there may be something to be said for chaining.


I would have said exactly the same thing about open addressing. With
chaining, I always waste as many pointers as the table contains. With
that many free slots, open addressing can do quite well; the
complexity is in the implementation.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html
EuroForth 2009: http://www.euroforth.org/ef09/
Reply With Quote
  #9 (permalink)  
Old 11-26-2009, 01:40 PM
Andrew Haley
Guest
 
Posts: n/a
Default Re: hash tables

Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
> Andrew Haley <andrew29@littlepinkcloud.invalid> writes:


> >Sometimes you see simple applications of hash tables that are not
> >really critical to application performance, where the programmer
> >has gone through the rigmarole of resolution by chaining, for
> >additional code complexity and no real advantage.


> Rigmarole? Additional code complexity? I find a hash table with
> external chaining to be very simple to implement: it's a little
> demuxing code tacked in front of a linked-list implementation.


Well, yeah, as opposed to a little demuxing code.

> In contrast, with open addressing the implementation is much more
> involved: The text books say that I should compute another, different
> hash function for dealing with the collisions,


Double hashing is only one option.

> and I have to deal with the possibility of having more entries than
> slots.


As I said previously, I assume that you're going to have to resize
eventually, no matter what kind of hash table you use. If that
assumption doesn't hold, then you will get into a situation where
every slot is full, and then resolution by chaining is the only
possibility. Clearly in that situation you can't possibly use open
addressing.

Andrew.
Reply With Quote
  #10 (permalink)  
Old 11-26-2009, 03:35 PM
Anton Ertl
Guest
 
Posts: n/a
Default Re: hash tables

Andrew Haley <andrew29@littlepinkcloud.invalid> writes:
>Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
>> Andrew Haley <andrew29@littlepinkcloud.invalid> writes:

>
>> >Sometimes you see simple applications of hash tables that are not
>> >really critical to application performance, where the programmer
>> >has gone through the rigmarole of resolution by chaining, for
>> >additional code complexity and no real advantage.

>
>> Rigmarole? Additional code complexity? I find a hash table with
>> external chaining to be very simple to implement: it's a little
>> demuxing code tacked in front of a linked-list implementation.

>
>Well, yeah, as opposed to a little demuxing code.


Plus a little collision handling code (both on lookup and insert; only
on lookup with chaining); plus a little code for resizing.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html
EuroForth 2009: http://www.euroforth.org/ef09/
Reply With Quote
  #11 (permalink)  
Old 11-26-2009, 04:18 PM
Bernd Paysan
Guest
 
Posts: n/a
Default Re: hash tables (was: symtab)

Anton Ertl wrote:
> In contrast, with open addressing the implementation is much more
> involved: The text books say that I should compute another, different
> hash function for dealing with the collisions, and I have to deal with
> the possibility of having more entries than slots. With chaining
> having a higher load factor results in degraded performance, with open
> addressing it results in failure; so I have to implement table
> overflow handling even in cases where a simple fixed-size table would
> be good enough with chaining.


Another note: Gforth's way to deal with multiple vocabularies in just one
shared hash table won't work with open addressing. And our method is much
more practical with the grow-your-hash-table-when-needed approach: Since we
have only one hash table, it's already large, and doesn't have to grow
often.

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/
Reply With Quote
  #12 (permalink)  
Old 11-26-2009, 06:09 PM
Anton Ertl
Guest
 
Posts: n/a
Default Re: hash tables (was: symtab)

Bernd Paysan <bernd.paysan@gmx.de> writes:
>Another note: Gforth's way to deal with multiple vocabularies in just one
>shared hash table won't work with open addressing.


Why not? Because our method relies on separate buckets staying
separate, whereas in open addressing you jump to another bucket if the
first is full. One could work around that by storying the wordlist
number with each word, but that costs extra memory.

Another advantage of chaining in the context of Forth is that it
provides easy and natural shadowing of older definitions with the same
name, because it keeps this property of linked lists. However, IIRC
we don't really make use of that in Gforth, because we always rebuild
the hash table when executing a marker.

>And our method is much
>more practical with the grow-your-hash-table-when-needed approach: Since we
>have only one hash table, it's already large, and doesn't have to grow
>often.


But when it grows, that's much more expensive. Still, if we used
separate hash tables, when somebody does 10 wordlists, each with 100
words, 640 of these 1000 words will have gone through one doubling,
320 of these 640 through two doublings, etc.; in contrast, because we
supersized the hash table from the start (one doubling beyond what the
algorithm would do by itself), none of the words would have to go
through any doubling with our scheme.

Another advantage is that there is less meta-data overhead: no need to
store a pointer, table size and population size for every wordlist.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html
EuroForth 2009: http://www.euroforth.org/ef09/
Reply With Quote
  #13 (permalink)  
Old 11-27-2009, 06:57 AM
John Passaniti
Guest
 
Posts: n/a
Default Re: symtab

On Nov 26, 2:19 am, Hugh Aguilar <hugoagui...@rosycrew.com> wrote:
> Nobody here has shown me any evidence to indicate that my symtab is
> sub-optimal.


At best, the only thing you're able to claim with regard to optimality
is that at any particular moment in time, the nodes are ideally
arranged given some *past* cumulative frequency of access. That isn't
a very useful measurement. Nobody cares about the optimality of the
tree *after* some number of symbols have been accessed. What matters
are the basic statistical measures (sum, mean, minimum, maximum,
standard deviation) of the access times leading up to that point (and
the same statistics afterward). If your measure of optimality is how
accurately the tree's shape reflects the frequency distribution of
symbols, then congratulations-- you've apparently solved a problem
that nobody cares about.

You're also putting a tremendous amount of faith that the cumulative
frequency distribution at any point in time actually matters. What
seems more likely to me is that one cares far less about the overall
frequency distribution than the temporal pattern of access. Say I
have a symbol accessed more than any other during a run, but only near
the start of the run. Your scheme will promote that symbol to the
root, where it will uselessly remain there during the remainder of the
run. How is that helpful? How is that adaptive?

> If splay trees don't store a count the way that my symtab trees do,
> then they aren't going to be as efficient, because they have less
> information to work with.


The insight that you're missing is that searching the tree is itself
information. It's the identification of a "hot" path of pointers to a
desired node. No count is necessary to store or later compare.

You are using a singular definition of efficiency based on the tree's
shape accurately reflecting the frequency distribution of symbols.
Ignoring for the moment that I question the very premise of that being
useful in the kinds of applications one typically wants a symbol table
for, it completely ignores the cost of restructuring the tree. Let's
compare:

1) When some threshold is reached, a symtab decides to stop the world
and restructure the tree. This requires iterating over the *entire*
tree, juggling nodes around. As the tree grows, that is a
increasingly expensive operation in terms of processor time.

2) A splay tree optimizes only along the "hot" path of pointers to a
node when searching. The rest of the tree is unaffected and retains
the shape that past patterns of access left in those untouched parts
of the tree. The end result is that splay trees do less work than
your symtab and that work done is amortized over time. In some
variations of splay trees, the work done is even less because the
searched node isn't moved to the root, but just promoted closer to it.

> It seems unrealistic that any algorithm is
> going to minimize both space and time. More likely, there is a trade-
> off. These splay trees minimize space (no count field) at a cost in
> time, whereas my symtab trees minimize time (they are optimal) at a
> cost in space (a count field).


Again, you've optimized for something nobody cares about. What good
is having an optimal tree structure in memory when you've just
finished searching the tree and the program is ending? What good is
having an optimal tree structure based on frequency if the frequency
of symbols no longer represents the temporal pattern of access? Splay
trees make no claim (and thus, put forth no effort) in ensuring that
the tree at any point in time is globally optimal. Instead, splay
trees are concerned with a looser local, temporal optimality. And
they do so amortizing the cost over time, making access times more
evenly distributed than your "hold on while I stop the world and
restructure the whole tree" strategy.

> Also, I don't think that comparing integers takes up much time.


Not comparing integers takes up even less time. And less code. And
less complexity. And less opportunity for error. And no need for
code that tests to see if count exceeds a fixnum and then iterates
over the entire tree, dividing the count by two.

> I'll do some research on these splay-trees and get back to you with a
> comparison. I still feel confident that, whatever splay-trees are, my
> symtab will be more efficient. If splay trees don't have a count
> field, then they have only a vague guess of how many times each node
> has been accessed. How much information can be stored in zero bits?


The information is stored in the path of pointers leading to the node.

> I'm not an expert on information theory, but I think the answer would
> be: "None."


Yes, it's clear you aren't an expert on information theory. It's
probably because you have a rather limited definition of
"information."

> I don't consider the memory usage of the count field to be much of an
> issue. Even with an extra word per node, symtab is still going to be
> more memory efficient than hash tables with their big sparse array.


Most real-world implementations of hash tables start small and then
grow as necessary.

> Also, a symtab tree is going to be more robust because it grows
> dynamically. To resize a hash table you have to reinsert all of the
> elements.


False. Many hash table implementation do indeed reinsert all the
elements because that is simple. But it isn't necessary. Linear
hashing is a technique that allows for growing a hash table one node
at a time, and it only requires rehashing the elements in a single
collision chain. Other techniques like "consistent hashing" works by
partitioning the storage space, and only requires reinsertion of the
elements that fall in that space.

> That is very slow.


False. Hash tables are used in real-time situations where consistent
speed matters (and where your symtab would fail). This is done by
amortizing the cost of reinserting nodes of over time. One common
technique to resizing a hash table that has real-time constraints is
to allocate a new empty table, and then use it for primary searches,
falling back to the original table when searches fail. New insertions
occur in the new table, and nodes are incrementally moved from the
original to the new table, bound to an upper limit (such as number of
nodes or an interval of time). When finished, you delete the original
table.

> It is true that I have my OPTIMIZE
> function that traverses the tree and optimizes it, but OPTIMIZE is
> quite fast. Also, OPTIMIZE is only necessary when the programmer is
> using a large DAMPER value. If you set the DAMPER value low you get
> more churning, which slows down the operation of the tree, but you
> don't have to run OPTIMIZE as often (maybe not at all).


Quite fast? I'm sorry, I'm unfamiliar with that unit of measure.
Help me convert that to processor cycles per node so that I can
meaningfully decide if what you think is "quite fast" is suitable for
any particular application? Thanks.

Here's your homework. It has six parts. This really shouldn't take
much time:

1) Your data set at the end of your code is cute. Let's instead pick
something more useful and representative. You call this a symbol
table, so let's throw real-world symbols at it. Let's take the
symbols from source code. Factor is kind of cool, so let's use it.
When I look at the current distribution for Linux, I see there are
3177 files named "*.factor". I want you to iterate through those
files and put the symbols you find in those files in your symtab.
(I'll define a symbol as whitespace-separated tokens for ease of
parsing.) Ideally, you'll eliminate comments first, but maybe that
doesn't matter. Regardless, let's roughly estimate how many symbols
there are:

$ find -name "*.factor" | xargs cat | awk '{ for(i=1;i<NF;i++) print
$i;}' | wc -w
1228623

Cool, roughly 1.2M symbols. I wonder how many of those are unique?

$ find -name "*.factor" | xargs cat | awk '{for(i=1;i<=NF;i++) print
$i;}' | sort -u | wc -w
106714

So we're talking in the neighborhood of 100k unique symbols to store
in your symtab. That should tell us something.

2) Instrument your code so that you record the time taken to insert/
access symbols in the table. I don't know Factor, but it probably has
a library routine to look at the system's high-resolution timer (or
better, the performance counters in your CPU). Use the best timer you
have to measure the insert/access time. Collect simple statistics;
the sum, mean, minimum, and maximum.

3) Comment-out the timer additions from the previous step and replace
it with a count of the number of probes it took to find the symbol.
So if a symbol was found at the root, that's 1. If it was found down
one level, that's 2. And so on. Again, collect simple statistics.

4) Report your numbers. Include the value(s) you used for DAMPER.

Okay, that will be interesting. And now it's time for the *really*
fun part:

5) Change the code to replace and increment of "count" when inserted/
accessed to storing the current time (hopefully with millisecond
resolution, but okay if not). Keep the rest of the code the same. In
other words, we're going to completely ignore frequency and instead
reorder the tree based in terms of when a field was last used.

6) Repeat steps 2 through 4, above.

And here's my fearless prediction: Getting rid of the frequency
counting will reduce the average number of probes to find a symbol,
probably dramatically. That's because you want more of a cache and
less of a statistical model of your symbols. I don't know if this
will result in a reduction of the average access time to find symbols;
that time could potentially be dominated by the OPTIMIZE routine
stopping the world and restructuring nodes.

Have fun.
Reply With Quote
  #14 (permalink)  
Old 11-28-2009, 02:40 AM
Hugh Aguilar
Guest
 
Posts: n/a
Default Re: symtab

On Nov 27, 12:57 am, John Passaniti <john.passan...@gmail.com> wrote:
> You're also putting a tremendous amount of faith that the cumulative
> frequency distribution at any point in time actually matters. What
> seems more likely to me is that one cares far less about the overall
> frequency distribution than the temporal pattern of access. Say I
> have a symbol accessed more than any other during a run, but only near
> the start of the run. Your scheme will promote that symbol to the
> root, where it will uselessly remain there during the remainder of the
> run. How is that helpful? How is that adaptive?


My symtab tree restructures itself to be optimal for a program that is
getting recompiled repeatedly. There is some churning in the beginning
of the development of that program. Pretty soon though, the tree
settles down into an optimal state and there is very little churning,
which mostly occurs as the programmer writes new words and these start
out at a leaf and work their way up to an appropriate place in the
tree.

This "temporal pattern of access" doesn't make sense to me. You seem
to imply that the entire tree gets restructured during the course of
compilation of a program. That is a lot of churning!

> You are using a singular definition of efficiency based on the tree's
> shape accurately reflecting the frequency distribution of symbols.
> Ignoring for the moment that I question the very premise of that being
> useful in the kinds of applications one typically wants a symbol table
> for, it completely ignores the cost of restructuring the tree.


I'm not "completely ignoring the cost of restructuring the tree." As I
said above, the tree gradually settles down into an optimum state and
there is less and less churning needed.

> 1) When some threshold is reached, a symtab decides to stop the world
> and restructure the tree. This requires iterating over the *entire*
> tree, juggling nodes around. As the tree grows, that is a
> increasingly expensive operation in terms of processor time.


I have no idea what you are talking about here; there is no threshold
that triggers anything to happen. My symtab never "stops the world"
and restructures the entire tree. I do have my OPTIMIZE function that
restructures the entire tree, but its use is optional. It is possible
to set DAMPER to a value > 0 (the default is 30), in which case the
tree is sub-optimal. The higher the DAMPER value, the further it is
from optimality, and the lower the DAMPER value the closer the tree is
to optimality, with a DAMPER value of zero providing perfect
optimality. The use of OPTIMIZE is always optional though. I recommend
that if you have a high DAMPER value, you should call OPTIMIZE
occasionally. A good time to do this is at the end of the day when you
are shutting down the system. It will "stop the world" and restructure
the entire tree, but you have already gone home at this time.

I do have my PRUNE function that removes all of the smudged nodes from
the leaves. This is also optional. PRUNE is more important than
OPTIMIZE though, as it saves memory by getting rid of all of those
smudged nodes. It also speeds up the tree because searches through the
tree won't have to go through all of those smudged nodes. PRUNE also
halves all of the count values in the tree if they have gotten too
big. This is important. If you never call PRUNE, the count values will
eventually overflow the single-precision integers that they are stored
in. This may take several months to happen, but it will happen
eventually, so you do want to call PRUNE occasionally --- when you are
going home at the end of the day is a good time. Fortunately, PRUNE is
extremely fast because it doesn't move nodes around at all --- it just
traverses the tree.

Your comments above, about stopping the world to restructure the
entire tree, and about some kind of threshold, imply that you don't
understand how symtab works. Study the code --- you'll figure it out
eventually.

> 1) Your data set at the end of your code is cute. Let's instead pick
> something more useful and representative. You call this a symbol
> table, so let's throw real-world symbols at it. Let's take the
> symbols from source code. Factor is kind of cool, so let's use it.


Symtab normally initializes each node with a count value of 1. If you
have a-priori knowledge of what symbols are going to be used, you can
initialize each node with a count value that you provide. Assuming
that your a-priori knowledge is a good predictor of the future, this
will get the symtab tree into an optimal structure a lot sooner than
if every node starts out at 1. In either case though, the tree will
eventually become optimal --- it just gets there sooner with some help
up front.

I read in Forth Dimensions a long time ago that somebody did some
statistics over a large number of Forth programs to determine which
are the most commonly used. This was way back in the days of linked
lists, and the idea was to structure the linked list by order of
frequency, rather than by the order that the words happened to be
defined. If symtab were to be used for a Forth dictionary, collecting
statistics like this would be a good idea so that the dictionary would
be pretty close to optimal out-of-the-box, assuming that the user's
programs will be similar to the programs written by other Forth
programmers.

The programmer might be different than everybody else. He might, for
example, write a lot of programs involving floating-point arithmetic,
whereas the programs that we collected statistics on might not have
used floating-point very much. In this case, there will be some
churning in the early days of the guy's usage of the compiler as the
floating-point words burble upwards in the tree. Pretty soon though,
the tree will have adapted itself its new owner's programming style
and will be optimal for him (assuming that his programming style
doesn't change over time).
Reply With Quote
  #15 (permalink)  
Old 11-28-2009, 03:27 AM
Hugh Aguilar
Guest
 
Posts: n/a
Default Re: symtab

On Nov 24, 3:34*pm, Bernd Paysan <bernd.pay...@gmx.de> wrote:
> Hugh Aguilar wrote:
> > Consider the case in which each wordlist has its own hast
> > table.

>
> In Gforth, we use a single hash table for all wordlists. *Different
> wordlists have a different "seed" into the hash table, i.e. when DUP is
> in wordlist 0 (the first one), it would go to slot 316 (just an
> example), if it's in wordlist 3, it would go to slot 319, you get the
> basic idea (we use XOR). *The size of the hash table is the next power
> of two larger than the total number of entries. *That way, you won't
> have too many collisions, and not waste too much space. *If the number
> of entries reaches the critical value, the hash-table size is doubled
> and the table re-populated.


UR/Forth did the same thing in its single hash tree; it prepended the
voc# on the front of each word's name. It is a lot more efficient to
put each word-list in its own data structure however. In the case of
my symtab, this means that we avoid comparing our search string to a
large number of nodes that are in other word-lists and therefore
aren't going to match. In the case of a hash table, it means a lot
fewer collisions. UR/Forth pretty much had to use a single hash table
because the 16-bit 8086 uses segmented memory and so there was only
one realistic size for a hash table (64K).

That business of resizing and repopulating the hash tree is very slow.
Also, this happens at unpredictable times and can cause the system to
freeze up unexpectedly. All in all, I think that my symtab is much
more robust than any hash table.

> I've experimented with AVL tress (balanced binary trees), and the
> benchmark conclusion was that
>
> a) AVL trees are harder to implement


I agree that AVL trees are hard to implement, and my experience is
that symtab trees are even harder to implement. Fortunately, I've
already done the heavy lifting, so it should be fairly easy for
anybody to integrate my symtab code into his compiler.

> b) they are significantly slower than the hashes as implemented above
> for larger vocabularies (O(log(N)) instead of O(1))


I agree that AVL trees are slow, but this doesn't have anything to do
with symtab trees.
Reply With Quote
 
Reply

Popular Tags in the Forum
symtab

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are Off
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Factor Hugh Aguilar Newsgroup comp.lang.forth 15 10-07-2009 09:31 PM



All times are GMT. The time now is 04:28 PM.


Copyright ©2009

LinkBacks Enabled by vBSEO 3.3.0 RC2 © 2009, Crawlability, Inc.