FileDict – a Persistent Dictionary in Python

Python's dictionary is possibly the most useful construct in the language.  And I argue that for some purposes, mapping it to a file (in real-time) can be even more useful.

*** Update ***

There's a newer and better version of FileDict, containing bugfixes and corrections, many of which are due to comments on this page.
You can read about it (with explanations) in http://erezsh.wordpress.com/2009/05/31/filedict-bug-fixes-and-updates/

Why?

The dictionary resides in memory, and so has three main "faults":

  1. It only lasts as long as your program does.
  2. It occupies memory that might be useful for other, more commonly accessed, data.
  3. It is limited to how much memory your machine has.

The first can be solved by pickling and unpickling the dictionary, but will not survive an unexpected shutdown (even putting the pickling in a try-finally block won't protect it against all errors).

FileDict

FileDict is a dictionary interface I wrote, that saves and loads its data from a file using keys. Current version uses Sqlite3 to provide consistency, and as a by-product, acidity.

The result is a dictionary which at all-times exists as a file, has virtually no size limit, and can be accessed by several processes concurrently.

It is meant as a quick-and-simple general-purpose solution. It is rarely the best solution, but it is usually good enough.

Performance obviously cannot compare to the builtin dictionary, but it is reasonable and of low complexity (refer to sqlite for more details on that).

Uses

FileDict can be used for many purposes, including:

  • Saving important data in a convinient manner
  • Managing large amounts of data in dictionary form, without the mess of implementing paging or other complex solutions
  • Communication between processes (sqlite supports multiple connections and implements ACID)

Examples

$ python
>>> import filedict
>>> d=filedict.FileDict(filename="example.dict")
>>> d['bla'] = 10
>>> d[(2,1)] = ['hello', (1,2) ]
-- exit --
$ python
>>> import filedict
>>> d=filedict.FileDict(filename="example.dict")
>>> print d['bla']
10
>>> print d.items()
[['bla', 10], [(2, 1), ['hello', (1, 2)]]]
>>> print dict(d)
{'bla': 10, (2, 1): ['hello', (1, 2)]}
>>> d=filedict.FileDict(filename="try.dict")
>>> with d.batch:  # using .batch suspend commits, making a batch of changes quicker
>>>    for i in range(100000):
>>>            d[i] = i**2
(takes about 8 seconds on my comp)
>>> print len(d)
100000
>>> del d[103]
>>> print len(d)
99999

Limitations

  • All data (keys and values) must be pickle-able
  • Keys must be hashable (perhaps this should be removed by hashing the pickled key)
  • Keys and values are stored as a copy, so changing them after assignment will not update the dictionary.

Source Code

Is availible in here in here

Future

Additions in the future may include:

  • An LRU-cache for fetching entries
  • A storage strategy different than Sqlite

Other suggestions?

Tags: , , , ,

Categorised in:

20 Comments

  • Why not use shelve, from the stdlib?

  • erezsh says:

    I'll be honest - I forgot about shelve, and did not take it into account.

    However, I can still try and answer your question:
    1. Shelve does not implement ACID
    2. Sqlite seems to use more compact storage
    3. There are good tools to examine Sqlite DBs
    4. My implementation doesn't limit the type of your keys (tho this can be mostly fixed with a wrapper)
    5. My implementation allows using the file for inter-process communication

    Thanks for pointing it out, there are certainly situations where shelve would be a better choice.

  • R says:

    Hashing the pickled key won't work because pickle is not consistent. See http://bugs.python.org/issue5518

  • erezsh says:

    Very interesting! I may have to adjust my code to avoid possible bugs. Thanks R.

  • lorg says:

    Re: "Python’s dictionary is possibly the most useful construct in the language."

    Well, it has been said that just as lisp is based on lists, Python is based on dicts 🙂

  • I played around with writing a few UserDicts with similar intent a while ago and ended up building them into a module.

    http://github.com/synack/ncore/blob/dd41752b16b544aa240e8d4f2409d80d674ae643/ncore/data.py

  • Mike says:

    This is very interesting, definitely going to be considering this for a few projects. 🙂

  • ErikW says:

    For these kind of dictionary-emulating classes, I often use the UserDict.DictMixin class. I find that you can get full dictionary emulation with little effort. An alternative implementation using DictMixin can be found here http://python.pastebin.com/f2d208b4

    I've shamelessly stolen the key concept of hash(key)/pickle(key) and indexing on the hash. However I have not optimized the pickling (using binary pickle).

  • Felipe Cruz says:

    Check out copycat.

    Transparent, pure OO persistence tool.

    I'ts a very different concept but has a lot more options than just work with dicts.

  • code43 says:

    Erez, for another project with similar objectives -- with compression -- plus support for regular expression on the key, see

    http://yserial.sourceforge.net

    a timestamp upon insert is also useful. It seems to me that your code currently will deadlock if used in concurrent situations.

    The general idea is definitely worth pursuing: persistance of schema-less data. Thanks very much.

  • gls says:

    Hey, this was just the "ready to go" implementation of sqlite i was searching! I needed a very high level abstraction layer to SQL instruction (I do not understand SQL thing, and I dont want to). Thanks, I am very enthusiast!

  • eru says:

    I am using version 24-May-2009. It works great, except for a bug in has_key:

    import Filedict
    h=Filedict.FileDict(filename='dummy.test')

    h[1]="blabla"
    r=h.get(1, None)
    print r # this is ok
    if h.has_key(1): # this is buggy
    print "ok"

    error-message, line 136:
    c = self.__conn.execute('select count(*) from %(tbl)s where hash=%(key_hash)d and key=%(key_pickle)r;' % locals())

  • erezsh says:

    Hi eru, have you noticed that I posted an updated version with bugfixes?
    You can find it here: http://erezsh.wordpress.com/2009/05/31/filedict-bug-fixes-and-updates/

    Also, has_key is deprecated, and you should replace "h.has_key(1)" with "1 in h".

    Please let me know if you are still getting errors.

  • I looked at filedict, and since I know nothing about SQL, I'm not sure where to plug in bzip2. (Perhaps the pack/unpack?) If you add a compression option, let me know, as I've manually compressed 2.5GB files, but should I want the data again, I will have to manually decompress them.

  • mystilleef says:

    Thank you for FileDict. It's replace Shelve in all my projects. Fantastically simple API too! This is a million times better than Shelve. I hope it finds its way into the standard library.

Leave a Reply

Your email address will not be published. Required fields are marked *