Search code examples
pythondictionarykey

efficiency of long (str) keys in python dictionary


I'm parsing some xml (with some python 3.4 code) and want to retrieve both the text from a node and its id attribute. Example: <li id="12345"> Some text here </li> My current code is structured around the text only (I'm now adding the id, but didn't need this before). I'm looping through a list of text/sentences, and then proceed to do some stuff. So I thought of making a dictionary with the text/sentence as key, and this id attribute as value.

However, this doesn't feel very efficient. The text can be a whole paragraph, making the key very long. Whereas the id is always of a fairly limited length (but still of type str though, e.g. some alpha characters followed by some digits). But making the ids the key and the text the value requires some rewriting of the code. All not very problematic, but this just got me wondering: How inefficient would it be to have the text (potentially a whole paragraph) as key, compared to an id like "ulp_887362487687678" as key?

I can just make two reverse dictionaries (one with id as key, the other with text as key) and compare construction and lookup and all. And I've also found some topics on key length limit (Do Dictionaries have a key length limit?). But I'm merely wondering what your thoughts are on this. Is having such long str keys in your dict something that you definitely want to avoid, or is it not a very big deal? If you could share some pro's/con's, that would be great!


Solution

  • No, Python string length hardly has an impact on dictionary performance. The only influence the string length could have is on the hash() function used map the key to a hash table slot.

    String length has very little impact on the performance of hash():

    >>> import random
    >>> from timeit import timeit
    >>> from string import ascii_letters
    >>> generate_text = lambda len: ''.join([random.choice(ascii_letters) for _ in range(len)])
    >>> for i in range(8):
    ...     length = 10 + 10 ** i
    ...     testword = generate_text(length)
    ...     timing = timeit('hash(t)', 'from __main__ import testword as t')
    ...     print(f'Length: {length}, timing: {timing}')
    ... 
    Length: 11, timing: 0.03713229100685567
    Length: 20, timing: 0.024632290937006474
    Length: 110, timing: 0.020530124893411994
    Length: 1010, timing: 0.019442707998678088
    Length: 10010, timing: 0.019319916958920658
    Length: 100010, timing: 0.01950641698203981
    Length: 1000010, timing: 0.019174500019289553
    Length: 10000010, timing: 0.0229318750789389
    

    I stopped at generating a string of 10 million characters, because I couldn't be bothered waiting for my laptop to generate a 100 million character string.

    The timings are pretty much constant, because the value is actually cached on the string object once computed.

    There is one edge-case that would affect performance: strings with extremely long common prefixes. A dictionary lookup consists of two stages: a hash lookup, which selects a bucket in the dictionary, and then an equality test against any existing keys in that bucket. The equality test could take a lot of time if your keys are not the same object (when is is sufficient), are very long, and most of their prefixes are the same. String equality testing uses memcomp, which means the comparison takes O(N) time with N being the length of the common prefix.

    >>> for i in range(7):
    ...     length = 10 + 10 ** i
    ...     testword = generate_text(length)
    ...     testother = testword[:-10] + generate_text(10)
    ...     timing = timeit('t == o', 'from __main__ import testword as t, testother as o')
    ...     print(f'Length: {length}, timing: {timing}')
    ...
    Length: 11, timing: 0.03756483399774879
    Length: 20, timing: 0.016259542084299028
    Length: 110, timing: 0.023491207975894213
    Length: 1010, timing: 0.03878312499728054
    Length: 10010, timing: 0.28688233299180865
    Length: 100010, timing: 2.5207629159558564
    Length: 1000010, timing: 25.976553458953276
    

    However, this should be rare. Don't use extremely long strings that are likely to start with the same text as keys.