|
|||
|
This is slightly off topic, but I'm hoping folks can point me in the
right direction. I'm looking for a fairly lightweight key/value store that works for this type of problem: ideally plays nice with the Python ecosystem the data set is static, and written infrequently enough that I definitely want *read* performance to trump all there is too much data to keep it all in memory (so no memcache) users will access keys with fairly uniform, random probability the key/value pairs are fairly homogenous in nature: keys are <= 16 chars values are between 1k and 4k bytes generally approx 3 million key/value pairs total amount of data == 6Gb needs to work on relatively recent versions of FreeBSD and Linux My current solution works like this: keys are file paths directories are 2 levels deep (30 dirs w/100k files each) values are file contents The current solution isn't horrible, but I'm try to squeeze a little performance/robustness out of it. A minor nuisance is that I waste a fair amount of disk space, since the values are generally less than 4k in size. A larger concern is that I'm not convinced that file systems are optimized for dealing with lots of little files in a shallow directory structure. To deal with the latter issue, a minor refinement would be to deepen the directory structure, but I want to do due diligence on other options first. I'm looking for something a little lighter than a full-on database (either SQL or no-SQL), although I'm not completely ruling out any alternatives yet. As I mention up top, I'm mostly hoping folks can point me toward sources they trust, whether it be other mailing lists, good tools, etc. To the extent that this is on topic and folks don't mind discussing this here, I'm happy to follow up on any questions. Thanks, Steve P.S. I've already found some good information via Google, but there's a lot of noise out there. |
|
|
||||
|
||||
|
|
|
|||
|
Steve Howell <showell30@yahoo.com> writes:
> keys are file paths > directories are 2 levels deep (30 dirs w/100k files each) > values are file contents > The current solution isn't horrible, Yes it is ;-) > As I mention up top, I'm mostly hoping folks can point me toward > sources they trust, whether it be other mailing lists, good tools, cdb sounds reasonable for your purposes. I'm sure there are python bindings for it. http://cr.yp.to/cdb.html mentions a 4gb limit (2**32) but I half-remember something about a 64 bit version. |
|
|||
|
On May 2, 7:46*pm, Paul Rubin <no.em...@nospam.invalid> wrote:
> Steve Howell <showel...@yahoo.com> writes: > > * keys are file paths > > * directories are 2 levels deep (30 dirs w/100k files each) > > * values are file contents > > The current solution isn't horrible, > > Yes it is ;-) > > As I mention up top, I'm mostly hoping folks can point me toward > > sources they trust, whether it be other mailing lists, good tools, > > cdb sounds reasonable for your purposes. *I'm sure there are python > bindings for it. > > http://cr.yp.to/cdb.htmlmentions a 4gb limit (2**32) but I > half-remember something about a 64 bit version. Thanks. That's definitely in the spirit of what I'm looking for, although the non-64 bit version is obviously geared toward a slightly smaller data set. My reading of cdb is that it has essentially 64k hash buckets, so for 3 million keys, you're still scanning through an average of 45 records per read, which is about 90k of data for my record size. That seems actually inferior to a btree-based file system, unless I'm missing something. I did find this as follow up to your lead: http://thomas.mangin.com/data/source/cdb.py Unfortunately, it looks like you have to first build the whole thing in memory. |
|
|||
|
Steve Howell <showell30@yahoo.com> writes:
> Thanks. That's definitely in the spirit of what I'm looking for, > although the non-64 bit version is obviously geared toward a slightly > smaller data set. My reading of cdb is that it has essentially 64k > hash buckets, so for 3 million keys, you're still scanning through an > average of 45 records per read, which is about 90k of data for my > record size. That seems actually inferior to a btree-based file > system, unless I'm missing something. 1) presumably you can use more buckets in a 64 bit version; 2) scanning 90k probably still takes far less time than a disk seek, even a "seek" (several microseconds in practice) with a solid state disk. > http://thomas.mangin.com/data/source/cdb.py > Unfortunately, it looks like you have to first build the whole thing > in memory. It's probably fixable, but I'd guess you could just use Bernstein's cdbdump program instead. Alternatively maybe you could use one of the *dbm libraries, which burn a little more disk space, but support online update. |
|
|||
|
On May 2, 8:29*pm, Paul Rubin <no.em...@nospam.invalid> wrote:
> Steve Howell <showel...@yahoo.com> writes: > > Thanks. *That's definitely in the spirit of what I'm looking for, > > although the non-64 bit version is obviously geared toward a slightly > > smaller data set. *My reading of cdb is that it has essentially 64k > > hash buckets, so for 3 million keys, you're still scanning through an > > average of 45 records per read, which is about 90k of data for my > > record size. *That seems actually inferior to a btree-based file > > system, unless I'm missing something. > > 1) presumably you can use more buckets in a 64 bit version; 2) scanning > 90k probably still takes far less time than a disk seek, even a "seek" > (several microseconds in practice) with a solid state disk. > Doesn't cdb do at least one disk seek as well? In the diagram on this page, it seems you would need to do a seek based on the value of the initial pointer (from the 256 possible values): http://cr.yp.to/cdb/cdb.txt > >http://thomas.mangin.com/data/source/cdb.py > > Unfortunately, it looks like you have to first build the whole thing > > in memory. > > It's probably fixable, but I'd guess you could just use Bernstein's > cdbdump program instead. > > Alternatively maybe you could use one of the *dbm libraries, > which burn a little more disk space, but support online update. Yup, I don't think I want to incur the extra overhead. Do you have any first hand experience pushing dbm to the scale of 6Gb or so? My take on dbm is that its niche is more in the 10,000-record range. |
|
|||
|
Steve Howell <showell30@yahoo.com> writes:
> Doesn't cdb do at least one disk seek as well? In the diagram on this > page, it seems you would need to do a seek based on the value of the > initial pointer (from the 256 possible values): Yes, of course it has to seek if there is too much data to fit in memory. All I'm saying is that if you're spending milliseconds on the seek, that may dominate the lookup time even if you scan the 90k. Actually, looking at the spec more closely, there are 256 hash tables in the file, but each table can be of any size. So there can be far more than 2**16 hash slots. Uncharacteristically for Bernstein, the document is pretty unclear, so maybe you have to look at the code to be sure of the layout. Note also that empty hash buckets have just 2 words of overhead. So with 3M keys and 75% load factor, you get 4M buckets and relatively few collisions. The extra 1M buckets in a 64 bit implementation is just 16MB in the file, which isn't much at all even considering that you want it to stay resident in memory to avoid some seeks, assuming you're on a PC and not some smaller device like a phone. (A phone will have a solid state disk eliminating most seek time, so you're ok in that situation too). > Yup, I don't think I want to incur the extra overhead. Do you have > any first hand experience pushing dbm to the scale of 6Gb or so? My > take on dbm is that its niche is more in the 10,000-record range. There are a bunch of different variants. I'm trying to remember what I've personally done with it and I'm sure I've used much more than 10k records, but maybe not millions. Unix dbm was originally designed to handle millions of records back when that was a lot. I'd expect gdbm, bsddb and so forth can handle it easily. The naive Python dbm module might not be able to. The amount of data you're talking about (a few million keys, a few gb of data) is fairly modest by today's standards, so I would think fancy methods aren't really needed. |
|
|||
|
Paul Rubin <no.email@nospam.invalid> writes:
>looking at the spec more closely, there are 256 hash tables.. ... You know, there is a much simpler way to do this, if you can afford to use a few hundred MB of memory and you don't mind some load time when the program first starts. Just dump all the data sequentially into a file. Then scan through the file, building up a Python dictionary mapping data keys to byte offsets in the file (this is a few hundred MB if you have 3M keys). Then dump the dictionary as a Python pickle and read it back in when you start the program. You may want to turn off the cyclic garbage collector when building or loading the dictionary, as it badly can slow down the construction of big lists and maybe dicts (I'm not sure of the latter). |
|
|||
|
On May 2, 11:48*pm, Paul Rubin <no.em...@nospam.invalid> wrote:
> Paul Rubin <no.em...@nospam.invalid> writes: > >looking at the spec more closely, there are 256 hash tables.. ... > > You know, there is a much simpler way to do this, if you can afford to > use a few hundred MB of memory and you don't mind some load time when > the program first starts. *Just dump all the data sequentially into a > file. *Then scan through the file, building up a Python dictionary > mapping data keys to byte offsets in the file (this is a few hundred MB > if you have 3M keys). *Then dump the dictionary as a Python pickle and > read it back in when you start the program. > > You may want to turn off the cyclic garbage collector when building or > loading the dictionary, as it badly can slow down the construction of > big lists and maybe dicts (I'm not sure of the latter). I'm starting to lean toward the file-offset/seek approach. I am writing some benchmarks on it, comparing it to a more file-system based approach like I mentioned in my original post. I'll report back when I get results, but it's already way past my bedtime for tonight. Thanks for all your help and suggestions. |
|
|||
|
On 5/3/2012 10:42, Steve Howell wrote:
> On May 2, 11:48 pm, Paul Rubin<no.em...@nospam.invalid> wrote: >> Paul Rubin<no.em...@nospam.invalid> writes: >>> looking at the spec more closely, there are 256 hash tables.. ... >> >> You know, there is a much simpler way to do this, if you can afford to >> use a few hundred MB of memory and you don't mind some load time when >> the program first starts. Just dump all the data sequentially into a >> file. Then scan through the file, building up a Python dictionary >> mapping data keys to byte offsets in the file (this is a few hundred MB >> if you have 3M keys). Then dump the dictionary as a Python pickle and >> read it back in when you start the program. >> >> You may want to turn off the cyclic garbage collector when building or >> loading the dictionary, as it badly can slow down the construction of >> big lists and maybe dicts (I'm not sure of the latter). > > I'm starting to lean toward the file-offset/seek approach. I am > writing some benchmarks on it, comparing it to a more file-system > based approach like I mentioned in my original post. I'll report back > when I get results, but it's already way past my bedtime for tonight. > > Thanks for all your help and suggestions. You should really cache the accesses to that file hoping that the accesses are not as random as you think. If that's the case you should notice a *huge* improvement. Kiuhnm |
|
|||
|
> I'm looking for a fairly lightweight key/value store that works for
> this type of problem: I'd start with a benchmark and try some of the things that are already in the standard library: - bsddb - sqlite3 (table of key, value, index key) - shelve (though I doubt this one) You might find that for a little effort you get enough out of one of these. Another module which is not in the standard library is hdf5/PyTables and in my experience very fast. HTH, -- Miki |
|
|||
|
On May 3, 1:42*am, Steve Howell <showel...@yahoo.com> wrote:
> On May 2, 11:48*pm, Paul Rubin <no.em...@nospam.invalid> wrote: > > > Paul Rubin <no.em...@nospam.invalid> writes: > > >looking at the spec more closely, there are 256 hash tables.. ... > > > You know, there is a much simpler way to do this, if you can afford to > > use a few hundred MB of memory and you don't mind some load time when > > the program first starts. *Just dump all the data sequentially into a > > file. *Then scan through the file, building up a Python dictionary > > mapping data keys to byte offsets in the file (this is a few hundred MB > > if you have 3M keys). *Then dump the dictionary as a Python pickle and > > read it back in when you start the program. > > > You may want to turn off the cyclic garbage collector when building or > > loading the dictionary, as it badly can slow down the construction of > > big lists and maybe dicts (I'm not sure of the latter). > > I'm starting to lean toward the file-offset/seek approach. *I am > writing some benchmarks on it, comparing it to a more file-system > based approach like I mentioned in my original post. *I'll report back > when I get results, but it's already way past my bedtime for tonight. > > Thanks for all your help and suggestions. I ended up going with the approach that Paul suggested (except I used JSON instead of pickle for persisting the hash). I like it for its simplicity and ease of troubleshooting. My test was to write roughly 4GB of data, with 2 million keys of 2k bytes each. The nicest thing was how quickly I was able to write the file. Writing tons of small files bogs down the file system, whereas the one- big-file approach finishes in under three minutes. Here's the code I used for testing: https://github.com/showell/KeyValue/...t_key_value.py Here are the results: ~/WORKSPACE/KeyValue > ls -l values.txt hash.txt -rw-r--r-- 1 steve staff 44334161 May 3 18:53 hash.txt -rw-r--r-- 1 steve staff 4006000000 May 3 18:53 values.txt 2000000 out of 2000000 records yielded (2k each) Begin READING test num trials 100000 time spent 39.8048191071 avg delay 0.000398048191071 real 2m46.887s user 1m35.232s sys 0m19.723s |
|
|||
|
Steve Howell <showell30@yahoo.com> writes:
> My test was to write roughly 4GB of data, with 2 million keys of 2k > bytes each. If the records are something like english text, you can compress them with zlib and get some compression gain by pre-initializing a zlib dictionary from a fixed english corpus, then cloning it. That is, if your messages are a couple paragraphs, you might say something like: iv = (some fixed 20k or so of records concatenated together) compressor = zlib(iv).clone() # I forget what this # operation is actually called # I forget what this is called too, but the idea is you throw # away the output of compressing the fixed text, and sync # to a byte boundary compressor.sync() zout = compressor.compress(your_record).sync() ... i.e. the part you save in the file is just the difference between compress(corpus) and compress(corpus_record). To decompress, you initialize a compressor the same way, etc. It's been a while since I used that trick but for json records of a few hundred bytes, I remember getting around 2:1 compression, while starting with an unprepared compressor gave almost no compression. |
|
|||
|
On May 3, 9:38*pm, Paul Rubin <no.em...@nospam.invalid> wrote:
> Steve Howell <showel...@yahoo.com> writes: > > My test was to write roughly 4GB of data, with 2 million keys of 2k > > bytes each. > > If the records are something like english text, you can compress > them with zlib and get some compression gain by pre-initializing > a zlib dictionary from a fixed english corpus, then cloning it. > That is, if your messages are a couple paragraphs, you might > say something like: > > * iv = (some fixed 20k or so of records concatenated together) > * compressor = zlib(iv).clone() *# I forget what this > * * * * * * * * * * * * * * * * *# operation is actually called > > * # I forget what this is called too, but the idea is you throw > * # away the output of compressing the fixed text, and sync > * # to a byte boundary > * compressor.sync() > > * zout = compressor.compress(your_record).sync() > * ... > > i.e. the part you save in the file is just the difference > between compress(corpus) and compress(corpus_record). *To > decompress, you initialize a compressor the same way, etc. > > It's been a while since I used that trick but for json records of a few > hundred bytes, I remember getting around 2:1 compression, while starting > with an unprepared compressor gave almost no compression. Sounds like a useful technique. The text snippets that I'm compressing are indeed mostly English words, and 7-bit ascii, so it would be practical to use a compression library that just uses the same good-enough encodings every time, so that you don't have to write the encoding dictionary as part of every small payload. Sort of as you suggest, you could build a Huffman encoding for a representative run of data, save that tree off somewhere, and then use it for all your future encoding/decoding. Is there a name to describe this technique? |
|
|||
|
Steve Howell <showell30@yahoo.com> writes:
> Sounds like a useful technique. The text snippets that I'm > compressing are indeed mostly English words, and 7-bit ascii, so it > would be practical to use a compression library that just uses the > same good-enough encodings every time, so that you don't have to write > the encoding dictionary as part of every small payload. Zlib stays adaptive, the idea is just to start with some ready-made compression state that reflects the statistics of your data. > Sort of as you suggest, you could build a Huffman encoding for a > representative run of data, save that tree off somewhere, and then use > it for all your future encoding/decoding. Zlib is better than Huffman in my experience, and Python's zlib module already has the right entry points. Looking at the docs, Compress.flush(Z_SYNC_FLUSH) is the important one. I did something like this before and it was around 20 lines of code. I don't have it around any more but maybe I can write something else like it sometime. > Is there a name to describe this technique? Incremental compression maybe? |
|
|||
|
On May 3, 11:03*pm, Paul Rubin <no.em...@nospam.invalid> wrote:
> Steve Howell <showel...@yahoo.com> writes: > > Sounds like a useful technique. *The text snippets that I'm > > compressing are indeed mostly English words, and 7-bit ascii, so it > > would be practical to use a compression library that just uses the > > same good-enough encodings every time, so that you don't have to write > > the encoding dictionary as part of every small payload. > > Zlib stays adaptive, the idea is just to start with some ready-made > compression state that reflects the statistics of your data. > > > Sort of as you suggest, you could build a Huffman encoding for a > > representative run of data, save that tree off somewhere, and then use > > it for all your future encoding/decoding. > > Zlib is better than Huffman in my experience, and Python's zlib module > already has the right entry points. *Looking at the docs, > Compress.flush(Z_SYNC_FLUSH) is the important one. *I did something like > this before and it was around 20 lines of code. *I don't have it around > any more but maybe I can write something else like it sometime. > > > Is there a name to describe this technique? > > Incremental compression maybe? Many thanks, this is getting me on the right path: compressor = zlib.compressobj() s = compressor.compress("foobar") s += compressor.flush(zlib.Z_SYNC_FLUSH) s_start = s compressor2 = compressor.copy() s += compressor.compress("baz") s += compressor.flush(zlib.Z_FINISH) print zlib.decompress(s) s = s_start s += compressor2.compress("spam") s += compressor2.flush(zlib.Z_FINISH) print zlib.decompress(s) |
|
|
![]() |
| Thread Tools | |
| Display Modes | |
|
|