Performance variance among string-compare methods

As a student, I have a course about using new features in Python3.8+ and it also includes Python3.10. So I'm now trying to use match-case in my homework to kill a long series of if-else which should be replace by switch-case in cpp.

In my homework I need a function to convert string "male/female/..." into integer. It's quite easy but I was suddenly interested in the performace of different string-compare methods so I designed the testing code below to test the performance among three different methods.

import timeit
from typing import Callable


def test_match_case(sex: str):
    match sex:
        case 'male':
            gender = 1
        case 'female':
            gender = 2
        case '________female':
            gender = 3
        case '#________________female':
            gender = 4
        case _:
            gender = 0

    return gender


def test_if_full_compare(sex: str):
    if sex == 'male':
        gender = 1
    elif sex == 'female':
        gender = 2
    elif sex == '________female':
        gender = 3
    elif sex == '#________________female':
        gender = 4
    else:
        gender = 0

    return gender


def test_if_startswith(sex: str):
    if sex.startswith('m'):
        gender = 1
    elif sex.startswith('f'):
        gender = 2
    elif sex.startswith('_'):
        gender = 3
    elif sex.startswith('#'):
        gender = 4
    else:
        gender = 0

    return gender


def _test_and_print(func: Callable, arg: str, times: int = 10000000):
    func_name = func.__name__
    time_cost = timeit.timeit(f"""{func_name}("{arg}")""", f"""from __main__ import {func_name}""", number=times)
    print(f"func: {func_name} arg: {arg} time_cost: {time_cost:.6f}")


_test_and_print(test_match_case, "male")
_test_and_print(test_match_case, "female")
_test_and_print(test_match_case, "________female")
_test_and_print(test_match_case, "#________________female")
_test_and_print(test_if_full_compare, "male")
_test_and_print(test_if_full_compare, "female")
_test_and_print(test_if_full_compare, "________female")
_test_and_print(test_if_full_compare, "#________________female")
_test_and_print(test_if_startswith, "male")
_test_and_print(test_if_startswith, "female")
_test_and_print(test_if_startswith, "________female")
_test_and_print(test_if_startswith, "#________________female")

Here is the testing result on my PC:

func: test_match_case arg: male time_cost: 0.517885
func: test_match_case arg: female time_cost: 0.610869
func: test_match_case arg: ________female time_cost: 0.726382
func: test_match_case arg: #________________female time_cost: 0.846604
func: test_if_full_compare arg: male time_cost: 0.487761
func: test_if_full_compare arg: female time_cost: 0.578670
func: test_if_full_compare arg: ________female time_cost: 0.666413
func: test_if_full_compare arg: #________________female time_cost: 0.827917
func: test_if_startswith arg: male time_cost: 0.917373
func: test_if_startswith arg: female time_cost: 1.362152
func: test_if_startswith arg: ________female time_cost: 1.817514
func: test_if_startswith arg: #________________female time_cost: 2.249916

I have learned a little about "Resident Memory" in Python, but I still have tons of questions about the result above.

match-case is slower than if-else in every case, why?
The time-cost gap between L4 and L8 is much smaller than the other three cases (L1-L5, L2-L6, L3-L7), why? Is this effected by "Resident Memory"?
startswith has such terrible speed with long string. In my gut this build-in function only needs to compare the first character so the time-cost of the last 4 cases should be nearly the same and L12 should be much faster than L8 since it don't need to traverse the entire string. But the truth is the exact opposite of my estimation. How to explain this?
If you find out any other question, please tell me.

I've already searched with words "[python] string compare" but didn't get any expected result.

Solution

There's three reasons for your performance issues, the first two essentially the language/interpreter's "fault", the other one a consequence of how you constructed your test cases:

Python is not a fully compiled, statically typed language, and this means match can't typically optimize much (if I recall correctly, it might even be required to consider cases in order, making it even harder to optimize), and in practice, never performs meaningful optimizations for "jumping" to the right case. The match construct's design actually adds a tiny amount of overhead over if/elif/else chains here, but it's otherwise equivalent to the if/elif/else chain, making it slightly slower in most cases (see details below; there's a reason it catches up on the final case).
Generalized code paths without dedicated bytecode support (e.g. arbitrary method calls) typically have higher fixed overhead than bytecode supported operations (e.g. direct == checks), so if the work done is the same, the bytecode operation wins. They've been improving this over time with LOAD_METHOD/CALL_METHOD opcodes that avoid creating bound method objects, and the Vectorcall protocol to reduce argument handling overhead, but a dedicated bytecode still wins in most cases, so if == and startswith do the same real work, == will win.
Even theoretically, startswith doesn't save anything in this particular use case; every one of your fixed strings is of a different length, so the actual equality test starts off with "do the lengths match?", finds they don't, and immediately says "not equal"; it never even reads a single byte of the string data when they're not equal. And when they are equal, odds are since three of the four strings involved are literals containing legal variable names (which CPython interns as an implementation detail), isn't not even comparing the string's data in the case where they're equal; it checks the "is interned" bit, finds its set in both, and knowing that, can compare the pointer addresses (it might even do this before the size check); two interned strings are equal if and only if they are the same string, so again, no data comparisons required.

Details on point #1/question #2: I checked the results of calling dis.dis on two identical functions, one of which used match/case and one of which used if/elif/else chains. The bytecode produced by CPython 3.10.0 is almost identical, except:

The match/case chain loads the string only once, then before each test save the last one, it uses the DUP_TOP instruction to double the value on the top of the stack (just double the reference, no copies made) so the new copy can be used for the comparison, and the original one sticks around for the next comparison. The DUP_TOP is a super-cheap instruction (roughly equivalent to the LOAD_FAST it uses in the if chain where you have to name the local variable over and over), but...
Since any successful comparison means the duplicate value isn't needed, all paths save the final one need to add a POP_TOP instruction to clear the duplicate that the if/elif chain doesn't need.

This explains your observation in question #2; all the code paths that end with a case that involves a DUP_TOP execute one extra bytecode instruction (the corresponding POP_TOP), while the final case doesn't need it so the instructions executed end up being roughly the same (it executes one LOAD_FAST and case count DUP_TOPs, rather than case count LOAD_FASTs); DUP_TOP being very slightly cheaper than LOAD_FAST (but not nearly enough to make up the cost of a POP_TOP when it's needed), you perform slightly better on the final case of the match that uses the value being matched.