Same Python code appears to have different performance characteristics

Description

In my code to create N random strings of length k, a generator expression to create a random k-character string gets called either N+1 or (N+1)*k (aka N*k+N) times. This appears to be happening with the exact same code and I can switch back and forth seemingly non-deterministically by redefining the chrs() helper function. I suspect the profiler but don't have any clue as to the mechanism.

System details, in case they matter: CPython, Python 2.7.17 and Python 3.6.9 on Ubuntu 18.04 on HP Envy 15k (i7-4720)

Reproducing the Issue

Not a proper MCVE as I haven't been able to deterministically reproduce the issue; instead, running this will hopefully result in one of the two profiler outputs below. If you try running this in the REPL and redefining the chrs() function repeatedly using the same code, you might be able to get both profiler outputs.

I wrote this bits_strings() function when benchmarking various ways of generating N (num_strings) random strings of length k (len_string):

import cProfile
import random
import string

def chrs(bits, len_string):
    return ''.join(chr((bits >> (8*x)) & 0xff) for x in range(len_string))

def bits_strings(num_strings, len_string):
    return list(
        chrs(random.getrandbits(len_string*8), len_string)
        for x in range(num_strings)
    )

cProfile.run('bits_strings(1000, 2000)')

To benchmark it, I used cProfile.run().

When I ran the code the first time, it seemed decently fast:

cProfile.run('bits_strings(1000, 2000)')
         2005005 function calls in 0.368 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    1.970    1.970 <stdin>:1(bits_strings)
     1000    0.001    0.000    1.963    0.002 <stdin>:1(chrs)
     1001    0.001    0.000    1.970    0.002 <stdin>:2(<genexpr>)
        1    0.000    0.000    1.970    1.970 <string>:1(<module>)
  2000000    0.173    0.000    0.173    0.000 {chr}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
     1000    0.005    0.000    0.005    0.000 {method 'getrandbits' of '_random.Random' objects}
     1000    0.178    0.000    1.953    0.002 {method 'join' of 'str' objects}
     1001    0.009    0.000    0.009    0.000 {range}

but later in the day:

cProfile.run('bits_strings(1000, 2000)')

         4005004 function calls in 1.960 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    1.961    1.961 <stdin>:1(bits_strings)
     1000    0.001    0.000    1.954    0.002 <stdin>:1(chrs)
  2001000    1.593    0.000    1.762    0.000 <stdin>:2(<genexpr>)
        1    0.000    0.000    1.961    1.961 <string>:1(<module>)
  2000000    0.170    0.000    0.170    0.000 {chr}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
     1000    0.005    0.000    0.005    0.000 {method 'getrandbits' of '_random.Random' objects}
     1000    0.182    0.000    1.944    0.002 {method 'join' of 'str' objects}
     1001    0.009    0.000    0.009    0.000 {range}

The difference looks like it comes from the <genexpr>, which- from the cumtime and the location across various similar tests, I've inferred to be the one in chrs(). (My instinct is that (chr((bits >> (8*x)) & 0xff) for x in range(len_string)) should be getting called N times, while chr((bits >> (8*x)) & 0xff) should be getting called N*k times).

So what's the difference between the two bits_strings() functions? When trying to find the difference in the code between the two attempts, I found none. In fact, I had a Python REPL session where I defined the functions, ran the profiler twice (the second time, to check whether this was a caching issue), and then re-defined the functions by using the up-arrow to retrieve my previously-entered code. It exhibited the same divergence: https://pastebin.com/1u1j1ZUt

Theories & Thoughts

I suspect cProfile.

Switched between the "fast" and "slow" versions by redefining chrs() but had cProfile.run() get profiled by cProfile.run(). In both observed time and per the outer cProfile.run()'s reporting, it looks like the inner cProfile.run() (the one that profiles bits_strings()) has been occasionally underreporting how long bits_strings() takes: https://pastebin.com/C4W4FEjJ

I also used time.time() before/after cProfile.run('bits_strings(1000,1000)) calls and noticed a discrepancy between the "fast" profile cumtime and the actual wall-clock time taken when running the method.

But I'm still unclear on the mechanism.

Caching: Same code, drastically different performance between runs could be a caching thing. However, the fast version was the first run I ever did with this code (although I'd run other code before in the same REPL session). And I can switch from slow to fast or fast to slow by simply redefining chrs() and bits_strings(), although whether or not I actually switch seems random. Also worth noting is that I've only been able to switch (afaict) by redefining the functions, not by simply rerunning them. So I'm not sure what would be cached or how it would change the # of <genexpr> calls by a factor of k upon redefining the function.
cProfile quirk: I ran the same code without cProfile.run() and timed it by getting the current time.time() inside bits_strings twice; the total time taken was comparable to that with cProfile.run() for the slow case. So cProfile.run() is at least not slowing things down. There is a possibility that the fast result is tainted by cProfile.run(),~~, but anecdotally for large values of N and k ((10000, 10000)) I noticed a perceptible difference between the first run of bits_strings() and the slow one later in the day~~.
interpreter quirk: I thought this could be happening because of the interpreter wrongly optimizing the generator expressions inside chrs() and bits_strings(), but this is the you-pay-for-what-you-write CPython interpreter so that seems unlikely.
Python 2.7 quirk: Redefining chrs() (using the up-arrow to retrieve the prior definition) inconsistently does the same trick in python3: https://pastebin.com/Uw7PgF7i
Zalgo coming to exact revenge for that one time I parsed HTML with regex: not ruled out

I'm tempted to believe that bits_strings(), at different times (different definitions of the same function with the same code, not different runs), has actually performed differently. But I have no clue as to why that could be the case. I suspect this is related more closely to chrs() than it is to bits_strings().

I'm unclear on why redefining chrs() seems to be unpredictably affecting the profiler output.

Solution

This is an artifact of the cProfile results reporting, which cannot distinguish between two code objects with the same name, file name, and line number.

While gathering statistics, stats are maintained in a randomly-rebalancing binary search tree, apparently a randomized splay tree variant. Code object memory addresses are used as keys for Python functions (and PyMethodDef memory addresses for C functions).

However, when profiling is complete, the data is reorganized into a dict, using (file name, line number, code object name) tuples as keys for Python functions. This loses information in cases when these keys are not unique, such as your example. In such cases, the code object with the highest memory address gets processed last, so that code object is the one whose stats appear in the resulting dict.

In your tests, you see stats for either the genexp in chrs or the genexp in bits_strings. Which stats you see depends on which genexp's code object has a higher memory address.

If you had run your code as a script instead of interactively, the genexps would have been distinguishable by line number, and both genexps would have appeared in the result. Interactive mode restarts line numbers for each "unit" of code entered interactively (one >>> prompt and all following ... prompts). However, there are still cases where this profiling artifact would occur non-interactively, such as with two genexps on the same line, or two files with the same name.