In Raymond Hettinger's Talk he shows a representation for the index of a database. Shown below it's
[None, 4, None, 1, None, None, 0, None, 2, None, 3, None, None, None,None]
I think he explains it later in the talk but I'm having trouble putting the pieces together
While I kind of get that it contains indices to the table, what is that array and how is it generated? What does it represent? To be really specific what does 4 represent in the 1st position in the list (2nd if were not zero indexing) of the list?
Time limits necessitated a little brevity and hand-waving in that part of the presentation so that I could get to the central topic, the tech inside Python dictionaries. Here is a bit fuller explanation.
In relational databases, the data may be stored compactly in a flat file:
Row Number Name Color City Fruit
---------- -------- -------- --------- -------
0 'guido', 'blue', 'austin', 'apple'
1 'sarah', 'orange', 'dallas', 'banana'
2 'barry', 'green', 'tuscon', 'orange'
3 'rachel', 'yellow', 'reno', 'pear'
4 'tim', 'red', 'portland', 'peach'
Note, there is no wasted space (holes between the records or fields).
Also, the order of insertion is preserved (new records are appended to the end in order of arrival).
To speed lookups, a user may create a separate index into the table. Here is an alphabetical index that can be searched using O(log n)
bisection:
Name Row Number
-------- ----------
'barry' 2
'guido' 0
'rachel' 3
'sarah' 1
'tim' 4
In Python, that index table would be represented using a list of indices:
[2, 0, 3, 1, 4]
Note that there is no wasted space (holes between the entries).
Better lookup performance can be obtained by using a hash function to construct an index table. Here is a hashed index table giving O(1)
search performance:
Hash Name Row Number
---- -------- ----------
0 - -
1 'tim' 4
2 - -
3 'sarah' 1
4 - -
5 - -
6 'guido' 0
7 - -
8 'barry' 2
9 - -
10 'rachel' 3
11 - -
12 - -
13 - -
14 - -
15 - -
In Python, that index table would be represented using a list of indices but with a None
value for the unused slots:
[None, 4, None, 1, None, None, 0, None, 2,
None, 3, None, None, None, None, None]
These particular values were obtained by running some hash function over each of the names, taking the result modulo 16 (because there are 16 slots in the index table), and performing collision resolution. The particulars are unimportant.
What is essential is that a hashed index table allows faster lookups than an alphabetical index, but it comes at the cost of introducing a few "holes" and wasting a little space in the index table.
Effectively, Python 3.6 dictionaries make the same density/sparseness choices as databases using hashed index tables. Both of them store the core data densely (including both the keys and values). Both store the keys only once. Both use a sparse table of indices for fast O(1)
lookup. And both preserve insertion order.
The analogy is imperfect in a number of ways. For example, database flat files store the data directly in the table while Python containers just have references to the field values. Also, hash tables are typically kept in RAM while databases typically use persistent storage.
However, the analogy is reasonably accurate about space utilization and about how insertion order can be preserved.