Search code examples
pythonstringcachingpython-internals

Python - Why not all immutable objects are always cached?


I am not sure what is happening under the hood with regards to the Python object model for the code below.

You can download the data for the ctabus.csv file from this link

import csv

def read_as_dicts(filename):
    records = []
    with open(filename) as f:
        rows = csv.reader(f)
        headers = next(rows)

        for row in rows:
            route = row[0]
            date = row[1]
            daytype = row[2]
            rides = int(row[3])
            records.append({
                    'route': route,
                    'date': date,
                    'daytype': daytype,
                    'rides': rides})

    return records

# read data from csv
rows = read_as_dicts('ctabus.csv')
print(len(rows)) #736461

# record route ids (object ids)
route_ids = set()
for row in rows:
    route_ids.add(id(row['route']))

print(len(route_ids)) #690072

# unique_routes
unique_routes = set()
for row in rows:
    unique_routes.add(row['route'])

print(len(unique_routes)) #185

When I call print(len(route_ids)) it prints "690072". Why did Python end up creating these many objects?

I expect this count to be either 185 or 736461. 185 because, when I count the unique routes in set the length of that set comes out to be 185. 736461 because, this is the total number of records in csv file.

What is this weird number "690072"?

I am trying to understand why this partial caching? Why python can't perform a full caching something like below.

import csv

route_cache = {}

#some hack to cache
def cached_route(routename):
    if routename not in route_cache:
        route_cache[routename] = routename
    return route_cache[routename]

def read_as_dicts(filename):
    records = []
    with open(filename) as f:
        rows = csv.reader(f)
        headers = next(rows)

        for row in rows:
            row[0] = cached_route(row[0]) #cache trick
            route = row[0]
            date = row[1]
            daytype = row[2]
            rides = int(row[3])
            records.append({
                    'route': route,
                    'date': date,
                    'daytype': daytype,
                    'rides': rides})

    return records

# read data from csv
rows = read_as_dicts('ctabus.csv')
print(len(rows)) #736461

# unique_routes
unique_routes = set()
for row in rows:
    unique_routes.add(row['route'])

print(len(unique_routes)) #185

# record route ids (object ids)
route_ids = set()
for row in rows:
    route_ids.add(id(row['route']))

print(len(route_ids)) #185

Solution

  • A typical record from the file looks like following:

    rows[0]
    {'route': '3', 'date': '01/01/2001', 'daytype': 'U', 'rides': 7354}
    

    That means most of your immutable objects are strings and only the 'rides'-value is an integer.

    For small integers (-5...255), Python3 keeps an integer pool - so these small integers feels like being cached (as long as PyLong_FromLong and Co. are used).

    The rules are more complicated for strings - they are, as pointed out by @timgeb, interned. There is a greate article about interning, even if it is about Python2.7 - but not much changed since then. In a nutshell, the most important rules are:

    1. all strings of length 0 and 1 are interned.
    2. stings with more than one character are interned if they constist of characters that can be used in identifiers and are created at compile time either directly or through peephole optimization/constant folding (but in the second case only if the result is no longer than 20 characters (4096 since Python 3.7).

    All of the above are implementation details, but taking them into account we get the following for the row[0] above:

    1. 'route', 'date', 'daytype', 'rides' are all interned because they created at compile time of the function read_as_dicts and don't have "strange" characters.
    2. '3' and 'W' are interned because their length is only 1.
    3. 01/01/2001 isn't interned because it is longer than 1, created at runtime and whouldn't qualify anyway because it has character / in it.
    4. 7354 isn't from the small integer pool, because too large. But other entries might be from this pool.

    This was an explanation for the current behavior, with only some objects being "cached".

    But why doesn't Python cache all created strings/integer?

    Let's start with integers. In order to be able to look-up fast if an integer-number is already created (much faster than O(n)), one has to keep an additional look-up data-structure, which needs additional memory. However, there are so many integers, that the probability to hit one already existing integer again is not very high, so the memory overhead for the look-up-data-structure will not be repaid in the most cases.

    Because strings need more memory, the relative (memory) cost of the look-up data-structure isn't that high. But it doesn't make any sense to intern a 1000-character-string, because the probability for a randomly created string to have the very same characters is almost 0!

    On the other hand, if for example a hash-dictionary is used as the look-up structure, the calculation of the hash will take O(n) (n-number of characters), which probably won't pay off for large strings.

    Thus, Python makes a trade off, which works pretty well in most scenarios - but it cannot be perfect in some special cases. Yet for those special scenarios you can optimize per hand using sys.intern().


    Note: Having the same id doesn't mean to be the same object, if the live time of two objects don't overlapp, - so your reasoning in the question isn't entrirely watherproof - but this is of no consequence in this special case.