Lately I've been refactoring some bash scripts into Python 3.7 as both a learning exercise and for real use in a project. The resulting implementation uses a very large ordered dictionary, say around 2 to 3 million entries. Storing the data this way has some significant advantages, reducing code complexity and processing time. However, there is one task that eludes me: how to step through the dictionary from a known starting point.
If I were doing this in C, I would make pointer to the desired start point and walk the pointer. If there is any analogous operation in Python, I don't know it and can't find it. All the techniques I found seem to duplicate the some/all of information into a new list, which would be time consuming and waste a lot of memory in my application. It also seems that you can't slice a dictionary, even though they are now ordered by default.
Consider this contrived example dictionary of the Latin alphabet, whose strangely keyed entries are grouped by vowels and consonants, and the entries within each group are sorted alphabetically:
dd = { # key: ( phonetic, letter, ascii, ebcedic, baudot, morse, hollerith, strokes, kind )
4296433290: ( 'Alfa', 'A', 65, 193, 3, '.-', (12,1), 3, 'vowl' ),
5046716526: ( 'Echo', 'E', 69, 197, 1, '.', (12,5), 4, 'vowl' ),
5000200584: ( 'India', 'I', 73, 201, 6, '..', (12,9), 3, 'vowl' ),
5000971262: ( 'Oscar', 'O', 79, 214, 24, '---', (11,6), 1, 'vowl' ),
5000921625: ( 'Uniform', 'U', 85, 228, 7, '..-', (0,4), 1, 'vowl' ),
4297147083: ( 'Yankee', 'Y', 89, 232, 21, '-.--', (0,8), 3, 'vowl' ),
4297256046: ( 'Bravo', 'B', 66, 194, 25, '-...', (12,2), 3, 'cons' ),
4298140290: ( 'Charlie', 'C', 67, 195, 14, '-.-.', (12,3), 1, 'cons' ),
5036185622: ( 'Delta', 'D', 68, 196, 9, '-..', (12,4), 2, 'cons' ),
5036854221: ( 'Foxtrot', 'F', 70, 198, 13, '..-.', (12,6), 3, 'cons' ),
5037458768: ( 'Golf', 'G', 71, 199, 26, '--.', (12,7), 2, 'cons' ),
5035556903: ( 'Hotel', 'H', 72, 200, 20, '....', (12,8), 3, 'cons' ),
5037119814: ( 'Juliett', 'J', 74, 209, 11, '.---', (11,1), 2, 'cons' ),
5035556831: ( 'Kilo', 'K', 75, 210, 15, '-.-', (11,2), 3, 'cons' ),
4296755665: ( 'Lima', 'L', 76, 211, 18, '.-..', (11,3), 2, 'cons' ),
5035557110: ( 'Mike', 'M', 77, 212, 28, '--', (11,4), 4, 'cons' ),
5037118125: ( 'November', 'N', 78, 213, 12, '-.', (11,5), 3, 'cons' ),
5000423356: ( 'Papa', 'P', 80, 215, 22, '.--.', (11,7), 2, 'cons' ),
5000923300: ( 'Quebec', 'Q', 81, 216, 23, '--.-', (11,8), 2, 'cons' ),
5000969482: ( 'Romeo', 'R', 82, 217, 10, '.-.', (11,9), 3, 'cons' ),
5035943840: ( 'Sierra', 'S', 83, 226, 5, '...', (0,2), 1, 'cons' ),
5045251209: ( 'Tango', 'T', 84, 227, 16, '-', (0,3), 2, 'cons' ),
5000168680: ( 'Victor', 'V', 86, 229, 30, '...-', (0,5), 2, 'cons' ),
4296684445: ( 'Whiskey', 'W', 87, 230, 19, '.--', (0,6), 4, 'cons' ),
5000923277: ( 'Xray', 'X', 88, 231, 29, '-..-', (0,7), 2, 'cons' ),
4296215569: ( 'Zulu', 'Z', 90, 233, 17, '--..', (0,9), 3, 'cons' ),
}
And let's say I want to perform some processing on the consonants. And since the processing takes a lot of time (think days), I would like to do it in chunks. In this case, let's say 4 consonants at a time. I know ahead of time the keys for the beginning of a group, for example:
vowlbeg = 4296433290 # key of first vowel
consbeg = 4297256046 # key of first consonant
But I can't figure out how to take advantage of this foreknowledge. For example, to process the 8th through 11th consonant, the best I can do is:
beg = 8 # begin processing with 8th consonant
end = 12 # end processing with 11th consonant
kind = 'cons' # desired group
i=-1
for d in dd.items():
if d[1][-1] is not kind: continue
i += 1
if i < beg: continue
if i >= end: break
print('processing:', i, d)
Which gives the desired results, albeit a bit slowly because I'm walking through the whole dictionary, from the beginning, until I encounter the desired entries.
processing: 8 (5035556831, ('Kilo', 'K', 75, 210, 15, '-.-', (11, 2), 3, 'cons'))
processing: 9 (4296755665, ('Lima', 'L', 76, 211, 18, '.-..', (11, 3), 2, 'cons'))
processing: 10 (5035557110, ('Mike', 'M', 77, 212, 28, '--', (11, 4), 4, 'cons'))
processing: 11 (5037118125, ('November', 'N', 78, 213, 12, '-.', (11, 5), 3, 'cons'))
I think I can express this loop more compactly using list or maybe dictionary comprehensions, but it seems that would create a huge duplicate in memory. Maybe the method above does that, too, I'm not 100% sure.
Things I know about my ordered dictionary
Q: Is there a better way to do this? My backup plan is to just bite the bullet and keep a duplicate set of tuples, one per group, in order to be able to slice it. But that will essentially double my memory, as best I understand it.
Note: It's not evident from this silly example, but being able to access entries by keys in a single dictionary is a HUGE advantage in my application.
For a simple solution using the Python built-ins, you can create a list of keys and then start from any point in the list at the expense of some memory usage for materializing the list. See below for an interactive session demonstrating the point.
It should be easy to do a loop over any range of keys using this technique.
1> data = {id: (id, "a") for id in range(10)}
2> data
{0: (0, 'a'), 1: (1, 'a'), 2: (2, 'a'), 3: (3, 'a'), 4: (4, 'a'), 5: (5, 'a'), 6: (6, 'a'), 7: (7, 'a'), 8: (8, 'a'), 9: (9, 'a')}
3> data.keys()
dict_keys([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
4> data.keys()[5]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'dict_keys' object does not support indexing
5> keys = list(data.keys())
6> keys[5]
5
7> data[keys[5]]
(5, 'a')