|
|||
|
Here's a 22-line beauty for a classic and amazing algorithm:
http://bit.ly/bloom_filter The wiki article on the algorithm is brief and well-written: http://en.wikipedia.org/wiki/Bloom_filter It turns out that people in the 1970's were pretty smart :-) Raymond ------- follow my other python tips and recipes on twitter: @raymondh |
|
|
||||
|
||||
|
|
|
|||
|
On 04-05-11 20:17, Raymond Hettinger wrote:
> Here's a 22-line beauty for a classic and amazing algorithm: > http://bit.ly/bloom_filter > > The wiki article on the algorithm is brief and well-written: > http://en.wikipedia.org/wiki/Bloom_filter > > It turns out that people in the 1970's were pretty smart :-) > I think that often, the cleverness of people is inversely proportional to the amount of CPU power and RAM that they have in their computer. "Meh, what would I need such a thing for, I could just as well stick everything into a list" Thankfully there are also people for whom this doesn't count. (are they stuck in the '70s?) Also: wasn't there a talk on Pycon in which a bloom filter was mentioned? Irmen |
|
|||
|
> > It turns out that people in the 1970's were pretty smart :-)
> > I think that often, the cleverness of people is inversely proportional > to the amount of CPU power and RAM that they have in their computer. The Google guys have plenty of CPU power *and* plenty of cleverness :-) According to the wikipedia article, Google BigTable uses Bloom filters to reduce the disk lookups for non-existent rows or column. The Google Chrome web browser also uses Bloom filters to speed up its Safe Browsing service. > Also: wasn't there a talk on Pycon in which a bloom filter was mentioned? Yes! As a matter of fact there was: http://www.slideshare.net/c.titus.br...-bloom-filters Raymond ------- follow my other python tips and recipes on twitter: @raymondh |
|
|||
|
On 2011-05-04, Irmen de Jong <irmen@-NOSPAM-xs4all.nl> wrote:
> On 04-05-11 20:17, Raymond Hettinger wrote: >> Here's a 22-line beauty for a classic and amazing algorithm: >> http://bit.ly/bloom_filter >> >> The wiki article on the algorithm is brief and well-written: >> http://en.wikipedia.org/wiki/Bloom_filter >> >> It turns out that people in the 1970's were pretty smart :-) >> > > I think that often, the cleverness of people is inversely > proportional to the amount of CPU power and RAM that they have in > their computer. True. Unfortunately the difficulty in debugging and maintaining code is often directly proportional to the cleverness exhibited by the original programmer. -- Grant Edwards grant.b.edwards Yow! I'm also against at BODY-SURFING!! gmail.com |
|
|||
|
Raymond Hettinger <python@rcn.com> writes:
> Here's a 22-line beauty for a classic and amazing algorithm: > http://bit.ly/bloom_filter The use of pickle to serialize the keys is a little bit suspicious if there might be a reason to dump the filter to disk and re-use it in another run of the program. Pickle representation might change between Python releases, for example. It's just supposed to stay interoperable between versions, not necessarily bitwise-identical. Otherwise it's quite nice. I'd suggest adding a .update() operation that adds keys from a user-supplied iterator. |
|
|||
|
On 04-05-11 21:13, Raymond Hettinger wrote:
>>> It turns out that people in the 1970's were pretty smart :-) >> >> I think that often, the cleverness of people is inversely proportional >> to the amount of CPU power and RAM that they have in their computer. > > The Google guys have plenty of CPU power *and* plenty of > cleverness :-) Haha, true. We need to add a Googlyness factor in the equation. Or perhaps: think what they could achieve if they only had a few machines instead of thousands ;-) >> Also: wasn't there a talk on Pycon in which a bloom filter was mentioned? > > Yes! As a matter of fact there was: > http://www.slideshare.net/c.titus.br...-bloom-filters Thanks, that was the one. I didn't attend Pycon but I watched a truckload of talks on blip.tv and that one caught my attention because of its somewhat funny title 'handling ridiculous amounts of data with probabilistic data structures' I didn't understand all of it but the name Bloom Filter stuck, it seems. Adding it to my list of bookmarks of useful-stuff-I-intend-to-use-one-day-in-the-future... Irmen |
|
|||
|
On 5/4/2011 2:17 PM, Raymond Hettinger wrote:
> Here's a 22-line beauty for a classic and amazing algorithm: > http://bit.ly/bloom_filter > > The wiki article on the algorithm is brief and well-written: > http://en.wikipedia.org/wiki/Bloom_filter As I understand the article, the array of num_bits should have num_probes (more or less independent) bits set for each key. But as I understand the code for i in range(self.num_probes): h, array_index = divmod(h, num_words) h, bit_index = divmod(h, 32) yield array_index, 1 << bit_index the same bit is being set or tested num_probes times. The last three lines have no dependence on i that I can see, so they appear to do the same thing each time. This seems like a bug. The article says "For a good hash function with a wide output, there should be little if any correlation between different bit-fields of such a hash, so this type of hash can be used to generate multiple "different" hash functions by slicing its output into multiple bit fields. Alternatively, one can pass k different initial values (such as 0, 1, ..., k − 1) to a hash function that takes an initial value;or add (or append) these values to the key." I do not see the code doing either of these. -- Terry Jan Reedy |
|
|||
|
On May 4, 12:42*pm, Terry Reedy <tjre...@udel.edu> wrote:
> On 5/4/2011 2:17 PM, Raymond Hettinger wrote: > > > Here's a 22-line beauty for a classic and amazing algorithm: > >http://bit.ly/bloom_filter > > > The wiki article on the algorithm is brief and well-written: > >http://en.wikipedia.org/wiki/Bloom_filter > > As I understand the article, the array of num_bits should have > num_probes (more or less independent) bits set for each key. But as I > understand the code > > * * * * *for i in range(self.num_probes): > * * * * * * *h, array_index = divmod(h, num_words) > * * * * * * *h, bit_index = divmod(h, 32) > * * * * * * *yield array_index, 1 << bit_index > > the same bit is being set or tested num_probes times. The last three > lines have no dependence on i that I can see, so they appear to do the > same thing each time. This seems like a bug. The 512 bits in h are progressively eaten-up between iterations. So each pass yields a different (array index, bit_mask) pair. It's easy to use the interactive prompt to show that different probes are produced on each pass: >>> bf = BloomFilter(num_bits=1000, num_probes=8) >>> pprint(list(bf.get_probes('Alabama'))) [(19, 1073741824), (11, 64), (9, 134217728), (25, 1024), (24, 33554432), (6, 16), (7, 16777216), (22, 1048576)] The 512 bits are uncorrelated -- otherwise sha512 wouldn't be much of a cryptographic hash ![]() The fifty state example in the recipe is a reasonable demonstration that the recipe works as advertised. It successfully finds all fifty states (the true positives) and it tries 100,000 negatives resulting in only a handful of false negatives. That should be somewhat convincing that it all works. Raymond ------- follow my other python tips and recipes on twitter: @raymondh |
|
|||
|
On May 4, 12:27*pm, Paul Rubin <no.em...@nospam.invalid> wrote:
> Raymond Hettinger <pyt...@rcn.com> writes: > > Here's a 22-line beauty for a classic and amazing algorithm: > >http://bit.ly/bloom_filter > > The use of pickle to serialize the keys is a little bit suspicious if > there might be a reason to dump the filter to disk and re-use it in > another run of the program. *Pickle representation might change between > Python releases, for example. *It's just supposed to stay interoperable > between versions, not necessarily bitwise-identical. > > Otherwise it's quite nice. *I'd suggest adding a .update() operation > that adds keys from a user-supplied iterator. I chose pickle because it solved the problem of turning arbitrary objects into bytes which are needed as inputs to sha512. It seems that this particular choice of hash function is distracting some readers away from the interesting part of how a Bloom filter works. Since the choice of hash functions is completely arbitrary, I'm thinking of substituting something a little more basic. What do you think of this alternative as a way to make it clearer that each successive probe is uncorrelated, yet fully dependent on the key? def get_probes(self, key): hasher = Random(key).randrange num_words = len(self.arr) for _ in range(self.num_probes): array_index = hasher(num_words) bit_index = hasher(32) yield array_index, 1 << bit_index Raymond ------- follow my other python tips and recipes on twitter: @raymondh |
|
|||
|
Grant Edwards <invalid@invalid.invalid> writes:
> On 2011-05-04, Irmen de Jong <irmen@-NOSPAM-xs4all.nl> wrote: > > I think that often, the cleverness of people is inversely > > proportional to the amount of CPU power and RAM that they have in > > their computer. > > True. > > Unfortunately the difficulty in debugging and maintaining code is > often directly proportional to the cleverness exhibited by the > original programmer. +1 QOTW -- \ “Without cultural sanction, most or all of our religious | `\ beliefs and rituals would fall into the domain of mental | _o__) disturbance.” —John F. Schumaker | Ben Finney |
|
|||
|
On 5/4/2011 5:39 PM, Raymond Hettinger wrote:
> The 512 bits in h are progressively eaten-up between iterations. So > each pass yields a different (array index, bit_mask) pair. Yeh, obvious now that I see it. > It's easy to use the interactive prompt to show that different probes > are produced on each pass: > >>>> bf = BloomFilter(num_bits=1000, num_probes=8) >>>> pprint(list(bf.get_probes('Alabama'))) > [(19, 1073741824), > (11, 64), > (9, 134217728), > (25, 1024), > (24, 33554432), > (6, 16), > (7, 16777216), > (22, 1048576)] Should have tried that. > The 512 bits are uncorrelated -- otherwise sha512 wouldn't be much of > a cryptographic hash ![]() > The fifty state example in the recipe is a reasonable demonstration > that the recipe works as advertised. It successfully finds all fifty > states (the true positives) and it tries 100,000 negatives resulting > in only a handful of false negatives. I presume you mean 'false positives', as in the program comment and Wikipedia. The test would be more convincing to many with 100000 other geographic names (hard to come by, I know), or other english names or words or even with longer random strings that matched the lengths of the state names. But an average of 5/100000 false positives in 5 runs is good. -- Terry Jan Reedy |
|
|||
|
On May 4, 5:26*pm, Terry Reedy <tjre...@udel.edu> wrote:
> The test would be more convincing to many with 100000 other geographic > names (hard to come by, I know), or other english names or words or even > with longer random strings that matched the lengths of the state names. > But an average of 5/100000 false positives in 5 runs is good. I've just posted an update with a spell checker using a large dictionary. That should make it easy to validate against some real world text samples. Raymond |
|
|||
|
On Thu, May 5, 2011 at 5:02 AM, Irmen de Jong <irmen@-nospam-xs4all.nl> wrote:
> I think that often, the cleverness of people is inversely proportional to > the amount of CPU power and RAM that they have in their computer. As Mark Rosewater is fond of saying, restrictions breed creativity. Lack of computational resources is a major restriction (for an extreme example of RAM shortage, look at how much code you can fit into a boot sector without loading anything more from the disk). Take away all the restrictions, and people will tend to code sloppily. Chris Angelico |
|
|||
|
On May 4, 2:17*pm, Raymond Hettinger <pyt...@rcn.com> wrote:
> Here's a 22-line beauty for a classic and amazing algorithm:http://bit.ly/bloom_filter > > The wiki article on the algorithm is brief and well-written:http://en.wikipedia.org/wiki/Bloom_filter > > It turns out that people in the 1970's were pretty smart :-) > > Raymond > > ------- > follow my other python tips and recipes on twitter: *@raymondh Very cool! Thanks. |
|
|
![]() |
| Thread Tools | |
| Display Modes | |
|
|