Go Back   Rhinocerus > Newsgroup > Newsgroup comp.lang.python

Reply
 
Thread Tools Display Modes
  #16 (permalink)  
Old 05-04-2012, 06:59 AM
Paul Rubin
Guest
 
Posts: n/a
Default Re: key/value store optimized for disk storage

Steve Howell <showell30@yahoo.com> writes:
> compressor = zlib.compressobj()
> s = compressor.compress("foobar")
> s += compressor.flush(zlib.Z_SYNC_FLUSH)
>
> s_start = s
> compressor2 = compressor.copy()


I think you also want to make a decompressor here, and initialize it
with s and then clone it. Then you don't have to reinitialize every
time you want to decompress something.

I also seem to remember that the first few bytes of compressed output
are always some fixed string or checksum, that you can strip out after
compression and put back before decompression, giving further savings in
output size when you have millions of records.
Reply With Quote
Alt Today
Advertising
 
and become member of Rhinocerus
Standard Sponsored Links

  #17 (permalink)  
Old 05-04-2012, 07:14 AM
Steve Howell
Guest
 
Posts: n/a
Default Re: key/value store optimized for disk storage

On May 3, 11:59*pm, Paul Rubin <no.em...@nospam.invalid> wrote:
> Steve Howell <showel...@yahoo.com> writes:
> > * * compressor = zlib.compressobj()
> > * * s = compressor.compress("foobar")
> > * * s += compressor.flush(zlib.Z_SYNC_FLUSH)

>
> > * * s_start = s
> > * * compressor2 = compressor.copy()

>
> I think you also want to make a decompressor here, and initialize it
> with s and then clone it. *Then you don't have to reinitialize every
> time you want to decompress something.


Makes sense. I believe I got that part correct:

https://github.com/showell/KeyValue/..._compressor.py

> I also seem to remember that the first few bytes of compressed output
> are always some fixed string or checksum, that you can strip out after
> compression and put back before decompression, giving further savings in
> output size when you have millions of records.


I'm pretty sure this happens for free as long as the salt is large
enough, but maybe I'm misunderstanding.
Reply With Quote
  #18 (permalink)  
Old 05-04-2012, 08:01 AM
Paul Rubin
Guest
 
Posts: n/a
Default Re: key/value store optimized for disk storage

Steve Howell <showell30@yahoo.com> writes:
> Makes sense. I believe I got that part correct:
>
> https://github.com/showell/KeyValue/..._compressor.py


The API looks nice, but your compress method makes no sense. Why do you
include s.prefix in s and then strip it off? Why do you save the prefix
and salt in the instance, and have self.salt2 and s[len(self.salt):]
in the decompress? You should be able to just get the incremental bit.

> I'm pretty sure this happens for free as long as the salt is large
> enough, but maybe I'm misunderstanding.


No I mean there is some fixed overhead (a few bytes) in the compressor
output, to identify it as such. That's fine when the input and output
are both large, but when there's a huge number of small compressed
strings, it adds up.
Reply With Quote
  #19 (permalink)  
Old 05-04-2012, 08:09 AM
Steve Howell
Guest
 
Posts: n/a
Default Re: key/value store optimized for disk storage

On May 4, 1:01*am, Paul Rubin <no.em...@nospam.invalid> wrote:
> Steve Howell <showel...@yahoo.com> writes:
> > Makes sense. *I believe I got that part correct:

>
> > *https://github.com/showell/KeyValue/..._compressor.py

>
> The API looks nice, but your compress method makes no sense. *Why do you
> include s.prefix in s and then strip it off? *Why do you save the prefix
> and salt in the instance, and have self.salt2 and s[len(self.salt):]
> in the decompress? *You should be able to just get the incremental bit.


This is fixed now.

https://github.com/showell/KeyValue/..._compressor.py


> > I'm pretty sure this happens for free as long as the salt is large
> > enough, but maybe I'm misunderstanding.

>
> No I mean there is some fixed overhead (a few bytes) in the compressor
> output, to identify it as such. *That's fine when the input and output
> are both large, but when there's a huge number of small compressed
> strings, it adds up.


It it's in the header, wouldn't it be part of the output that comes
before Z_SYNC_FLUSH?



Reply With Quote
  #20 (permalink)  
Old 05-04-2012, 08:58 AM
Paul Rubin
Guest
 
Posts: n/a
Default Re: key/value store optimized for disk storage

Steve Howell <showell30@yahoo.com> writes:
>> You should be able to just get the incremental bit.

> This is fixed now.


Nice.

> It it's in the header, wouldn't it be part of the output that comes
> before Z_SYNC_FLUSH?


Hmm, maybe you are right. My version was several years ago and I don't
remember it well, but I half-remember spending some time diddling around
with this issue.
Reply With Quote
  #21 (permalink)  
Old 05-04-2012, 03:27 PM
Steve Howell
Guest
 
Posts: n/a
Default Re: key/value store optimized for disk storage

On May 3, 6:10*pm, Miki Tebeka <miki.teb...@gmail.com> wrote:
> > I'm looking for a fairly lightweight key/value store that works for
> > this type of problem:

>
> I'd start with a benchmark and try some of the things that are already inthe standard library:
> - bsddb
> - sqlite3 (table of key, value, index key)
> - shelve (though I doubt this one)
>


Thanks. I think I'm ruling out bsddb, since it's recently deprecated:

http://www.gossamer-threads.com/list.../python/106494

I'll give sqlite3 a spin. Has anybody out there wrapped sqlite3
behind a hash interface already? I know it's simple to do
conceptually, but there are some minor details to work out for large
amounts of data (like creating the index after all the inserts), so if
somebody's already tackled this, it would be useful to see their
code.

> You might find that for a little effort you get enough out of one of these.
>
> Another module which is not in the standard library is hdf5/PyTables and in my experience very fast.


Thanks.
Reply With Quote
  #22 (permalink)  
Old 05-05-2012, 10:03 AM
Jon Clements
Guest
 
Posts: n/a
Default Re: key/value store optimized for disk storage

On Friday, 4 May 2012 16:27:54 UTC+1, Steve Howell wrote:
> On May 3, 6:10*pm, Miki Tebeka <miki.teb...@gmail.com> wrote:
> > > I'm looking for a fairly lightweight key/value store that works for
> > > this type of problem:

> >
> > I'd start with a benchmark and try some of the things that are already in the standard library:
> > - bsddb
> > - sqlite3 (table of key, value, index key)
> > - shelve (though I doubt this one)
> >

>
> Thanks. I think I'm ruling out bsddb, since it's recently deprecated:
>
> http://www.gossamer-threads.com/list.../python/106494
>
> I'll give sqlite3 a spin. Has anybody out there wrapped sqlite3
> behind a hash interface already? I know it's simple to do
> conceptually, but there are some minor details to work out for large
> amounts of data (like creating the index after all the inserts), so if
> somebody's already tackled this, it would be useful to see their
> code.
>
> > You might find that for a little effort you get enough out of one of these.
> >
> > Another module which is not in the standard library is hdf5/PyTables and in my experience very fast.

>
> Thanks.


Could also look at Tokyo cabinet or Kyoto cabinet (but I believe that has slightly different licensing conditions for commercial use).
Reply With Quote
  #23 (permalink)  
Old 05-07-2012, 05:21 AM
John Nagle
Guest
 
Posts: n/a
Default Re: key/value store optimized for disk storage

On 5/4/2012 12:14 AM, Steve Howell wrote:
> On May 3, 11:59 pm, Paul Rubin<no.em...@nospam.invalid> wrote:
>> Steve Howell<showel...@yahoo.com> writes:
>>> compressor = zlib.compressobj()
>>> s = compressor.compress("foobar")
>>> s += compressor.flush(zlib.Z_SYNC_FLUSH)

>>
>>> s_start = s
>>> compressor2 = compressor.copy()


That's awful. There's no point in compressing six characters
with zlib. Zlib has a minimum overhead of 11 bytes. You just
made the data bigger.

John Nagle
Reply With Quote
  #24 (permalink)  
Old 05-07-2012, 05:34 AM
Paul Rubin
Guest
 
Posts: n/a
Default Re: key/value store optimized for disk storage

John Nagle <nagle@animats.com> writes:
> That's awful. There's no point in compressing six characters
> with zlib. Zlib has a minimum overhead of 11 bytes. You just
> made the data bigger.


This hack is about avoiding the initialization overhead--do you really
get 11 bytes after every SYNC_FLUSH? I do remember with that SYNC_FLUSH
trick, that I got good compression with a few hundred bytes of input
(the records I was dealing with at the time).
Reply With Quote
  #25 (permalink)  
Old 05-07-2012, 04:39 PM
Steve Howell
Guest
 
Posts: n/a
Default Re: key/value store optimized for disk storage

On May 6, 10:21*pm, John Nagle <na...@animats.com> wrote:
> On 5/4/2012 12:14 AM, Steve Howell wrote:
>
> > On May 3, 11:59 pm, Paul Rubin<no.em...@nospam.invalid> *wrote:
> >> Steve Howell<showel...@yahoo.com> *writes:
> >>> * * *compressor = zlib.compressobj()
> >>> * * *s = compressor.compress("foobar")
> >>> * * *s += compressor.flush(zlib.Z_SYNC_FLUSH)

>
> >>> * * *s_start = s
> >>> * * *compressor2 = compressor.copy()

>
> * * That's awful. There's no point in compressing six characters
> with zlib. *Zlib has a minimum overhead of 11 bytes. *You just
> made the data bigger.
>


The actual strings that I'm compressing are much longer than six
characters. Obviously, "foobar" was just for example purposes.

Reply With Quote
 
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are Off
Refbacks are Off




All times are GMT. The time now is 07:10 AM.


Copyright ©2009

LinkBacks Enabled by vBSEO 3.3.0 RC2 © 2009, Crawlability, Inc.