How do you use Python's string interning?

I was curious about the performance benefits on dictionary lookups when using Python's sys.intern. As a toy example to measure that, I implemented the following code snippets:

Without interning:

import random
from uuid import uuid4

keys = [str(uuid4()) for _ in range(1_000_000)]
values = [random.random() for _ in range(1_000_000)]
my_dict = dict(zip(keys, values))
keys_sample = random.choices(keys, k=50_000)


def get_values(d, ks):
    return [d[k] for k in ks]

Test results using ipython:

%timeit get_values(my_dict, keys_sample)
8.92 ms ± 17.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

With interning:

import sys
import random
from uuid import uuid4

keys = [sys.intern(str(uuid4())) for _ in range(1_000_000)]
values = [random.random() for _ in range(1_000_000)]
my_dict = dict(zip(keys, values))
keys_sample = random.choices(keys, k=50_000)


def get_values(d, ks):
    return [d[k] for k in ks]

Test results:

%timeit get_values(my_dict, keys_sample)
8.83 ms ± 17.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

So, no meaningful difference between both cases. I tried pumping up the dict size and sampling, but the results stayed on par. Am I using sys.intern incorrectly? Or is the testing flawed?

Solution

I assume you're referring to this section of the Python manual:

Enter string in the table of “interned” strings and return the interned string – which is string itself or a copy. Interning strings is useful to gain a little performance on dictionary lookup – if the keys in a dictionary are interned, and the lookup key is interned, the key comparisons (after hashing) can be done by a pointer compare instead of a string compare. [...]

(Source.)

In this benchmark, you are running into two optimizations of Python's string handling which make it impossible to gain any performance from intern().

When you do random.choice() on the list of strings, the sampled list is a list of references to the same strings. When Python compares two references to the same string, no string comparison is done - two references pointing to the same string must be equal. In other words, if a is b, then a == b.
When you compare two strings for equality, and a is not b, Python does one other check to avoid a string compare. It checks the hash of a and b, and if hash(a) != hash(b), then the strings cannot be equal. Also, hashes are cached within the string object to avoid repeatedly computing them.

So there are two cases during this dictionary lookup. Either it is comparing two of the same strings, in which case the reference equality shortcut prevents a string comparison, or it is comparing two different strings, in which case the hash comparison shortcut has a high probability of preventing a string comparison.

This gives us a path to create a (somewhat contrived) benchmark where sys.intern() does help. Imagine you have a server which receives strings from a socket and compares them to a hard-coded dictionary. In this case, the strings from the socket are not going to have reference equality with the strings from the dictionary.

To simulate this case, I have created a function which copies each of the strings in keys_sample, so that Python cannot apply optimization #1. It also prefixes each string with a thousand a's, to make string comparisons as expensive as possible.

import sys
import random
from uuid import uuid4


N = 100_000


def copy_string_list(l):
    """Returns copies of each string in l.

    Warning: Never do this in real code, it is slower and has no benefit."""
    return [(a + '.')[:-1] for a  in l]


def get_values(d, ks):
    return [d[k] for k in ks]


def generate_keys(intern, copy_strings):
    keys = ['a' * 1000 + str(uuid4()) for _ in range(N)]
    if intern:
        keys = [sys.intern(key) for key in keys]
    values = [random.random() for _ in range(N)]
    my_dict = dict(zip(keys, values))
    keys_sample = random.choices(keys, k=N // 2)
    if copy_strings:
        keys_sample = copy_string_list(keys_sample)
    if intern:
        keys_sample = [sys.intern(key) for key in keys_sample]
    return my_dict, keys_sample


print("Looking up interned strings")
my_dict, keys_sample = generate_keys(intern=True, copy_strings=True)
%timeit get_values(my_dict, keys_sample)
print("Looking up non-interned strings")
my_dict, keys_sample = generate_keys(intern=False, copy_strings=True)
%timeit get_values(my_dict, keys_sample)
print("Neither interning nor copying")
my_dict, keys_sample = generate_keys(intern=False, copy_strings=False)
%timeit get_values(my_dict, keys_sample)

Results:

Looking up interned strings
10.7 ms ± 129 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Looking up non-interned strings
19.9 ms ± 120 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Neither interning nor copying
10.5 ms ± 238 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)