Search code examples
databasepython-3.xindexinghashflat-file

In Raymond Hettingers Pycon 2017 Talk, What is the Database Representation


In Raymond Hettinger's Talk he shows a representation for the index of a database. Shown below it's

[None, 4, None, 1, None, None, 0, None, 2, None, 3, None, None, None,None]

I think he explains it later in the talk but I'm having trouble putting the pieces together

While I kind of get that it contains indices to the table, what is that array and how is it generated? What does it represent? To be really specific what does 4 represent in the 1st position in the list (2nd if were not zero indexing) of the list?

Talk Screenshot Link to the Pycon Video


Solution

  • Time limits necessitated a little brevity and hand-waving in that part of the presentation so that I could get to the central topic, the tech inside Python dictionaries. Here is a bit fuller explanation.

    Flat File

    In relational databases, the data may be stored compactly in a flat file:

    Row Number  Name       Color     City        Fruit
    ----------  --------   --------  ---------   -------
    0           'guido',   'blue',   'austin',   'apple'
    1           'sarah',   'orange', 'dallas',   'banana'
    2           'barry',   'green',  'tuscon',   'orange'
    3           'rachel',  'yellow', 'reno',     'pear'
    4           'tim',     'red',    'portland', 'peach'
    

    Note, there is no wasted space (holes between the records or fields).

    Also, the order of insertion is preserved (new records are appended to the end in order of arrival).

    Alphabetical Index

    To speed lookups, a user may create a separate index into the table. Here is an alphabetical index that can be searched using O(log n) bisection:

    Name      Row Number
    --------  ----------
    'barry'   2
    'guido'   0
    'rachel'  3
    'sarah'   1
    'tim'     4
    

    In Python, that index table would be represented using a list of indices:

    [2, 0, 3, 1, 4]
    

    Note that there is no wasted space (holes between the entries).

    Hashed Index

    Better lookup performance can be obtained by using a hash function to construct an index table. Here is a hashed index table giving O(1) search performance:

    Hash   Name      Row Number
    ----   --------  ----------
    0      -         -
    1      'tim'     4
    2      -         -
    3      'sarah'   1
    4      -         -
    5      -         -
    6      'guido'   0
    7      -         -
    8      'barry'   2
    9      -         -
    10     'rachel'  3
    11     -         -
    12     -         -
    13     -         -
    14     -         -
    15     -         -
    

    In Python, that index table would be represented using a list of indices but with a None value for the unused slots:

    [None, 4, None, 1, None, None, 0, None, 2,
     None, 3, None, None, None, None, None]
    

    These particular values were obtained by running some hash function over each of the names, taking the result modulo 16 (because there are 16 slots in the index table), and performing collision resolution. The particulars are unimportant.

    What is essential is that a hashed index table allows faster lookups than an alphabetical index, but it comes at the cost of introducing a few "holes" and wasting a little space in the index table.

    Conclusion

    Effectively, Python 3.6 dictionaries make the same density/sparseness choices as databases using hashed index tables. Both of them store the core data densely (including both the keys and values). Both store the keys only once. Both use a sparse table of indices for fast O(1) lookup. And both preserve insertion order.

    Caveats

    The analogy is imperfect in a number of ways. For example, database flat files store the data directly in the table while Python containers just have references to the field values. Also, hash tables are typically kept in RAM while databases typically use persistent storage.

    However, the analogy is reasonably accurate about space utilization and about how insertion order can be preserved.