Accurately testing Pypy vs CPython performance

The Problem Description:

I have this custom "checksum" function:

NORMALIZER = 0x10000


def get_checksum(part1, part2, salt="trailing"):
    """Returns a checksum of two strings."""

    combined_string = part1 + part2 + " " + salt if part2 != "***" else part1
    ords = [ord(x) for x in combined_string]

    checksum = ords[0]  # initial value

    # TODO: document the logic behind the checksum calculations
    iterator = zip(ords[1:], ords)
    checksum += sum(x + 2 * y if counter % 2 else x * y
                    for counter, (x, y) in enumerate(iterator))
    checksum %= NORMALIZER

    return checksum

Which I want to test on both Python3.6 and PyPy performance-wise. I'd like to see if the function would perform better on PyPy, but I'm not completely sure, what is the most reliable and clean way to do it.

What I've tried and the Question:

Currently, I'm using timeit for both:

$ python3.6 -mtimeit -s "from test import get_checksum" "get_checksum('test1' * 100000, 'test2' * 100000)"
10 loops, best of 3: 329 msec per loop

$ pypy -mtimeit -s "from test import get_checksum" "get_checksum('test1' * 100000, 'test2' * 100000)"
10 loops, best of 3: 104 msec per loop

My concern is I'm not absolutely sure if timeit is the right tool for the job on PyPy because of the potential JIT warmup overhead.

Plus, the PyPy itself reports the following before reporting the test results:

WARNING: timeit is a very unreliable tool. use perf or something else for real measurements
pypy -m pip install perf
pypy -m perf timeit -s 'from test import get_checksum' "get_checksum('test1' * 1000000, 'test2' * 1000000)"

What would be the best and most accurate approach to test the same exact function performance across these and potentially other Python implementations?

Solution

You could increase the number of repetitions with the --repeat parameter in order to improve timing accuracy. see:

https://docs.python.org/2/library/timeit.html