Search code examples
pythonperformancecomparison

Why is it faster to compare strings that match than strings that do not?


Here are two measurements:

timeit.timeit('"toto"=="1234"', number=100000000)
1.8320042459999968
timeit.timeit('"toto"=="toto"', number=100000000)
1.4517491540000265

As you can see, comparing two strings that match is faster than comparing two strings with the same size that do not match. This is quite disturbing: During a string comparison, I believed that Python was testing strings character by character, so "toto"=="toto" should be longer to test than "toto"=="1234" as it requires four tests against one for the non-matching comparison. Maybe the comparison is hash-based, but in this case, timings should be the same for both comparisons.

Why?


Solution

  • Combining my comment and the comment by @khelwood:

    TL;DR:
    When analysing the bytecode for the two comparisons, it reveals the 'time' and 'time' strings are assigned to the same object. Therefore, an up-front identity check (at C-level) is the reason for the increased comparison speed.

    The reason for the same object assignment is that, as an implementation detail, CPython interns strings which contain only 'name characters' (i.e. alpha and underscore characters). This enables the object's identity check.


    Bytecode:

    import dis
    
    In [24]: dis.dis("'time'=='time'")
      1           0 LOAD_CONST               0 ('time')  # <-- same object (0)
                  2 LOAD_CONST               0 ('time')  # <-- same object (0)
                  4 COMPARE_OP               2 (==)
                  6 RETURN_VALUE
    
    In [25]: dis.dis("'time'=='1234'")
      1           0 LOAD_CONST               0 ('time')  # <-- different object (0)
                  2 LOAD_CONST               1 ('1234')  # <-- different object (1)
                  4 COMPARE_OP               2 (==)
                  6 RETURN_VALUE
    

    Assignment Timing:

    The 'speed-up' can also be seen in using assignment for the time tests. The assignment (and compare) of two variables to the same string, is faster than the assignment (and compare) of two variables to different strings. Further supporting the hypothesis the underlying logic is performing an object comparison. This is confirmed in the next section.

    In [26]: timeit.timeit("x='time'; y='time'; x==y", number=1000000)
    Out[26]: 0.0745926329982467
    
    In [27]: timeit.timeit("x='time'; y='1234'; x==y", number=1000000)
    Out[27]: 0.10328884399496019
    

    Python source code:

    As helpfully provided by @mkrieger1 and @Masklinn in their comments, the source code for unicodeobject.c performs a pointer comparison first and if True, returns immediately.

    int
    _PyUnicode_Equal(PyObject *str1, PyObject *str2)
    {
        assert(PyUnicode_CheckExact(str1));
        assert(PyUnicode_CheckExact(str2));
        if (str1 == str2) {                  // <-- Here
            return 1;
        }
        if (PyUnicode_READY(str1) || PyUnicode_READY(str2)) {
            return -1;
        }
        return unicode_compare_eq(str1, str2);
    }
    

    Appendix:

    • Reference answer nicely illustrating how to read the disassembled bytecode output. Courtesy of @Delgan
    • Reference answer which nicely describes CPython's string interning. Coutresy of @ShadowRanger