|
|||
|
Steve Howell <showell30@yahoo.com> writes:
> compressor = zlib.compressobj() > s = compressor.compress("foobar") > s += compressor.flush(zlib.Z_SYNC_FLUSH) > > s_start = s > compressor2 = compressor.copy() I think you also want to make a decompressor here, and initialize it with s and then clone it. Then you don't have to reinitialize every time you want to decompress something. I also seem to remember that the first few bytes of compressed output are always some fixed string or checksum, that you can strip out after compression and put back before decompression, giving further savings in output size when you have millions of records. |
|
|
||||
|
||||
|
|
|
|||
|
On May 3, 11:59*pm, Paul Rubin <no.em...@nospam.invalid> wrote:
> Steve Howell <showel...@yahoo.com> writes: > > * * compressor = zlib.compressobj() > > * * s = compressor.compress("foobar") > > * * s += compressor.flush(zlib.Z_SYNC_FLUSH) > > > * * s_start = s > > * * compressor2 = compressor.copy() > > I think you also want to make a decompressor here, and initialize it > with s and then clone it. *Then you don't have to reinitialize every > time you want to decompress something. Makes sense. I believe I got that part correct: https://github.com/showell/KeyValue/..._compressor.py > I also seem to remember that the first few bytes of compressed output > are always some fixed string or checksum, that you can strip out after > compression and put back before decompression, giving further savings in > output size when you have millions of records. I'm pretty sure this happens for free as long as the salt is large enough, but maybe I'm misunderstanding. |
|
|||
|
Steve Howell <showell30@yahoo.com> writes:
> Makes sense. I believe I got that part correct: > > https://github.com/showell/KeyValue/..._compressor.py The API looks nice, but your compress method makes no sense. Why do you include s.prefix in s and then strip it off? Why do you save the prefix and salt in the instance, and have self.salt2 and s[len(self.salt):] in the decompress? You should be able to just get the incremental bit. > I'm pretty sure this happens for free as long as the salt is large > enough, but maybe I'm misunderstanding. No I mean there is some fixed overhead (a few bytes) in the compressor output, to identify it as such. That's fine when the input and output are both large, but when there's a huge number of small compressed strings, it adds up. |
|
|||
|
On May 4, 1:01*am, Paul Rubin <no.em...@nospam.invalid> wrote:
> Steve Howell <showel...@yahoo.com> writes: > > Makes sense. *I believe I got that part correct: > > > *https://github.com/showell/KeyValue/..._compressor.py > > The API looks nice, but your compress method makes no sense. *Why do you > include s.prefix in s and then strip it off? *Why do you save the prefix > and salt in the instance, and have self.salt2 and s[len(self.salt):] > in the decompress? *You should be able to just get the incremental bit. This is fixed now. https://github.com/showell/KeyValue/..._compressor.py > > I'm pretty sure this happens for free as long as the salt is large > > enough, but maybe I'm misunderstanding. > > No I mean there is some fixed overhead (a few bytes) in the compressor > output, to identify it as such. *That's fine when the input and output > are both large, but when there's a huge number of small compressed > strings, it adds up. It it's in the header, wouldn't it be part of the output that comes before Z_SYNC_FLUSH? |
|
|||
|
Steve Howell <showell30@yahoo.com> writes:
>> You should be able to just get the incremental bit. > This is fixed now. Nice. > It it's in the header, wouldn't it be part of the output that comes > before Z_SYNC_FLUSH? Hmm, maybe you are right. My version was several years ago and I don't remember it well, but I half-remember spending some time diddling around with this issue. |
|
|||
|
On May 3, 6:10*pm, Miki Tebeka <miki.teb...@gmail.com> wrote:
> > I'm looking for a fairly lightweight key/value store that works for > > this type of problem: > > I'd start with a benchmark and try some of the things that are already inthe standard library: > - bsddb > - sqlite3 (table of key, value, index key) > - shelve (though I doubt this one) > Thanks. I think I'm ruling out bsddb, since it's recently deprecated: http://www.gossamer-threads.com/list.../python/106494 I'll give sqlite3 a spin. Has anybody out there wrapped sqlite3 behind a hash interface already? I know it's simple to do conceptually, but there are some minor details to work out for large amounts of data (like creating the index after all the inserts), so if somebody's already tackled this, it would be useful to see their code. > You might find that for a little effort you get enough out of one of these. > > Another module which is not in the standard library is hdf5/PyTables and in my experience very fast. Thanks. |
|
|||
|
On Friday, 4 May 2012 16:27:54 UTC+1, Steve Howell wrote:
> On May 3, 6:10*pm, Miki Tebeka <miki.teb...@gmail.com> wrote: > > > I'm looking for a fairly lightweight key/value store that works for > > > this type of problem: > > > > I'd start with a benchmark and try some of the things that are already in the standard library: > > - bsddb > > - sqlite3 (table of key, value, index key) > > - shelve (though I doubt this one) > > > > Thanks. I think I'm ruling out bsddb, since it's recently deprecated: > > http://www.gossamer-threads.com/list.../python/106494 > > I'll give sqlite3 a spin. Has anybody out there wrapped sqlite3 > behind a hash interface already? I know it's simple to do > conceptually, but there are some minor details to work out for large > amounts of data (like creating the index after all the inserts), so if > somebody's already tackled this, it would be useful to see their > code. > > > You might find that for a little effort you get enough out of one of these. > > > > Another module which is not in the standard library is hdf5/PyTables and in my experience very fast. > > Thanks. Could also look at Tokyo cabinet or Kyoto cabinet (but I believe that has slightly different licensing conditions for commercial use). |
|
|||
|
On 5/4/2012 12:14 AM, Steve Howell wrote:
> On May 3, 11:59 pm, Paul Rubin<no.em...@nospam.invalid> wrote: >> Steve Howell<showel...@yahoo.com> writes: >>> compressor = zlib.compressobj() >>> s = compressor.compress("foobar") >>> s += compressor.flush(zlib.Z_SYNC_FLUSH) >> >>> s_start = s >>> compressor2 = compressor.copy() That's awful. There's no point in compressing six characters with zlib. Zlib has a minimum overhead of 11 bytes. You just made the data bigger. John Nagle |
|
|||
|
John Nagle <nagle@animats.com> writes:
> That's awful. There's no point in compressing six characters > with zlib. Zlib has a minimum overhead of 11 bytes. You just > made the data bigger. This hack is about avoiding the initialization overhead--do you really get 11 bytes after every SYNC_FLUSH? I do remember with that SYNC_FLUSH trick, that I got good compression with a few hundred bytes of input (the records I was dealing with at the time). |
|
|||
|
On May 6, 10:21*pm, John Nagle <na...@animats.com> wrote:
> On 5/4/2012 12:14 AM, Steve Howell wrote: > > > On May 3, 11:59 pm, Paul Rubin<no.em...@nospam.invalid> *wrote: > >> Steve Howell<showel...@yahoo.com> *writes: > >>> * * *compressor = zlib.compressobj() > >>> * * *s = compressor.compress("foobar") > >>> * * *s += compressor.flush(zlib.Z_SYNC_FLUSH) > > >>> * * *s_start = s > >>> * * *compressor2 = compressor.copy() > > * * That's awful. There's no point in compressing six characters > with zlib. *Zlib has a minimum overhead of 11 bytes. *You just > made the data bigger. > The actual strings that I'm compressing are much longer than six characters. Obviously, "foobar" was just for example purposes. |
|
|
![]() |
| Thread Tools | |
| Display Modes | |
|
|