FileDict – bug-fixes and updates

In my previous post I introduced FileDict. I did my best to get it right the first time, but as we all know, this is impossible for any non-trivial piece of code.
I want to thank everyone for their comments and remarks. It's been very helpful.

The Unreliable Pickle

A special thanks goes to the mysterious commenter "R", for pointing out that pickling identical objects may produce different strings (!), which are therefor inadequate to be used as keys. And my FileDict indeed suffered from this bug, as this example shows:

>>> key = (1, u'foo')
>>> d[(1, u'foo')] = 4
>>> d[(1, u'foo')]
4
>>> d[key]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "filedict.py", line 64, in __getitem__
    raise KeyError(key)
KeyError: (1, u'foo')

And if that's not bad enough:

>>> d[key] = 5
>>> list(d.items())
[['a', 3], [(1, 2), 3], [(1, u'foo'), 4], [(1, u'foo'), 5]]

Ouch.
I've rewritten the entire storing mechanism to poll only on hash and compare keys after unpickling. This may be a bit slower, but I don't (and shouldn't) expect many colliding hashes anyway.
Bug is fixed.

DictMixin

Under popular demand, I'm now inheriting from DictMixin. It's made my code a bit shorter, and was not at all painful.

Copy and Close

I no longer close the database on __del__, and instead I rely on the garbage collector. It seems to close the database on time, and it allows to one copy the dictionary (which, of course, will all be always have the same keys, but doesn't have to have the same behavior or attributes).

New Source Code

Is available here

Tags: , , , , , ,

Categorised in:

6 Comments

  • I have a web scraping and analysis project for which I've been using compressed pickles. As I've gotten more ambitious, the memory usage has increased beyond what I can support, so I thought, I'd get to try shelve! Unfortunately, my keys are unicode, so folks recommended sqllite. Never used that though, and I really want is a dict. So I'm happy to find filedict. I converted my dicts from pickles to filedicts and it seems to be working! I'll have to figure out about file size compression on my own though. In any case, thank you.

  • You may wish to warn people that your filedict does not behave like a dict with respect to values that are mutable types.

    In [3]: td = {}

    In [7]: a = [1,2]

    In [8]: td['a'] = a

    In [9]: td
    Out[9]: {'a': [1, 2]}

    In [10]: a.append(3)

    In [11]: a
    Out[11]: [1, 2, 3]

    In [12]: td
    Out[12]: {'a': [1, 2, 3]}

    In [14]: from filedict import FileDict

    In [15]: tfd = FileDict(filename = 'test.db')

    In [16]: a = [1,2]

    In [17]: tfd['a'] = a

    In [18]: tfd
    Out[18]: {'a': [1, 2]}

    In [19]: a.append(3)

    In [20]: a
    Out[20]: [1, 2, 3]

    In [21]: tfd
    Out[21]: {'a': [1, 2]}

    • erezsh says:

      Hi Joseph,

      I'm glad that you found FileDict useful.

      I'm sorry if its behavior confused you. FileDict stores a copy of the (keys and) values, and not the actual values, so changes to these values don't affect the copy. In this regard, shelve behaves the same way.

      It is always a humbling life lesson that what it obvious to me, isn't obvious to others, and vice-versa. I'll add a note about this in the original post.

  • Matteo says:

    Hello erezsh,

    I have modified your script a bit (to make it more shelve api compatible). I'd like to publish it, but I'd like to know what kind of license (MIT/BSD/Python?) you are using to do it in the proper way 🙂

    https://gist.github.com/661139

    • erezsh says:

      Hi Matteo,

      I didn't pick a license for this code, and I'm fine with any of the three you mentioned. Please let me know when you've published it 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *