Search code examples
pythonlistdictionaryasymptotic-complexity

Python - convert list into dictionary in order to reduce complexity


Let's say I have a big list:

word_list = [elt.strip() for elt in open("bible_words.txt", "r").readlines()] 

//complexity O(n) --> proporcional to list length "n"

I have learned that hash function used for building up dictionaries allows lookup to be much faster, like so:

word_dict = dict((elt, 1) for elt in word_list) 

// complexity O(l) ---> constant.

using word_list, is there a most efficient way which is recommended to reduce the complexity of my code?


Solution

  • The code from the question does just one thing: fills all words from a file into a list. The complexity of that is O(n).

    Filling the same words into any other type of container will still have at least O(n) complexity, because it has to read all of the words from the file and it has to put all of the words into the container.

    What is different with a dict?

    Finding out whether something is in a list has O(n) complexity, because the algorithm has to go through the list item by item and check whether it is the sought item. The item can be found at position 0, which is fast, or it could be the last item (or not in the list at all), which makes it O(n).

    In dict, data is organized in "buckets". When a key:value pair is saved to a dict, hash of the key is calculated and that number is used to identify the bucket into which data is stored. Later on, when the key is looked up, hash(key) is calculated again to identify the bucket and then only that bucket is searched. There is typically only one key:value pair per bucked, so the search can be done in O(1).

    For more detils, see the article about DictionaryKeys on python.org.

    How about a set?

    A set is something like a dictionary with only keys and no values. The question contains this code:

    word_dict = dict((elt, 1) for elt in word_list) 
    

    That is obviously a dictionary which does not need values, so a set would be more appropriate.

    BTW, there is no need to create a word_list which is a list first and convert it to set or dict. The first step can be skipped:

    set_of_words = {elt.strip() for elt in open("bible_words.txt", "r").readlines()}
    

    Are there any drawbacks?

    Always ;)

    • A set does not have duplicates. So counting how many times a word is in the set will never return 2. If that is needed, don't use a set.

    • A set is not ordered. There is no way to check which was the first word in the set. If that is needed, don't use a set.

    • Objects saved to sets have to be hashable, which kind-of implies that they are immutable. If it was possible to modify the object, then its hash would change, so it would be in the wrong bucket and searching for it would fail. Anyway, str, int, float, and tuple objects are immutable, so at least those can go into sets.

    • Writing to a set is probably going to be a bit slower than writing to a list. Still O(n), but a slower O(n), because it has to calculate hashes and organize into buckets, whereas a list just dumps one item after another. See timings below.

    • Reading everything from a set is also going to be a bit slower than reading everything from a list.

    All of these apply to dict as well as to set.

    Some examples with timings

    Writing to list vs. set:

    >>> timeit.timeit('[n for n in range(1000000)]', number=10)
    0.7802875302271843
    >>> timeit.timeit('{n for n in range(1000000)}', number=10)
    1.025623542189976
    

    Reading from list vs. set:

    >>> timeit.timeit('989234 in values', setup='values=[n for n in range(1000000)]', number=10)
    0.19846207875508526
    >>> timeit.timeit('989234 in values', setup='values={n for n in range(1000000)}', number=10)
    3.5699193290383846e-06
    

    So, writing to a set seems to be about 30% slower, but finding an item in the set is thousands of times faster when there are thousands of items.