Search code examples
pythoncpythonpython-internals

How are mylist.reverse() and list.reverse(mylist) executed?


Presumably both mylist.reverse() and list.reverse(mylist) end up executing reverse_slice in listobject.c via list_reverse_impl or PyList_Reverse. But how do they actually get there? What are the paths from the Python expressions to the C code in that C file? What connects them? And which of those two reverse functions (if any) do they go through?

Update for the bounty: Dimitris' answer (Update 2: I meant the original version, before it was expanded now) and the comments under it explain parts, but I'm still missing a few things and would like to see a comprehensive answer.

  • How do the two paths from the two Python expressions converge? If I understand things correctly, disassembling and discussing the byte code and what happens to the stack, particularly LOAD_METHOD, would clarify this. (As somewhat done by the comments under Dimitris' answer.)
  • What is the "unbound method" pushed onto the stack? Is it a "C function" (which one?) or a "Python object"?
  • How can I tell that it's the list_reverse function in the listobject.c.h file? I don't think the Python interpreter is like "let's look for a file that sounds similar and for a function that sounds similar". I rather suspect that the list type is defined somewhere and is somehow "registered" under the name "list", and that the reverse function is "registered" under the name "reverse" (maybe that's what the LIST_REVERSE_METHODDEF macro does?).
  • I'm not interested (for this question) about stack frames, argument handling, and similar things (so maybe not much of what goes on inside call_function). Really what interests me here is what I said originally, the path from the Python expressions to that C code in that C file. And preferably how I can find such paths in general.

To explain my motivation: For another question I wanted to know what C code does the work when I call list.reverse(mylist). I was fairly confident I had found it by browsing around and searching for the names. But I want to be more certain and generally understand the connections better.


Solution

  • PyList_Reverse is part of the C-API, you'd call it if you were manipulating Python lists in C, it isn't used in any of the two cases.

    These both go through list_reverse_impl (actually list_reverse which wraps list_reverse_impl) which is the C function that implements both list.reverse and list_instance.reverse.

    Both calls are handled by call_function in ceval, getting there after the CALL_METHOD opcode generated for them is executed (dis.dis the statements to see it). call_function has gone under a good deal of changes in Python 3.8 (with the introduction of PEP 590) so what happens from there on is probably too big of a subject to get into in a single question.

    Additional Questions:

    How do the two paths from the two python expressions converge? If I understand things correctly, disassembling and discussing the byte code and what happens to the stack, particularly LOAD_METHOD, would clarify this.

    Let's start after both expressions compile to their respective bytecode representations:

    l = [1, 2, 3, 4]
    

    Case A, for l.reverse() we have:

    1           0 LOAD_NAME                0 (l)
                2 LOAD_METHOD              1 (reverse)
                4 CALL_METHOD              0
                6 RETURN_VALUE
    

    Case B, for list.reverse(l) we have:

    1           0 LOAD_NAME                0 (list)
                2 LOAD_METHOD              1 (reverse)
                4 LOAD_NAME                2 (l)
                6 CALL_METHOD              1
                8 RETURN_VALUE
    

    We can safely ignore the RETURN_VALUE opcode, doesn't really matter here.

    Let's focus on the individual implementations for each op code, namely, LOAD_NAME, LOAD_METHOD and CALL_METHOD. We can see what gets pushed onto the value stack by viewing what operations are called on it. (Note, it is initialized to point to the value stack located inside the frame object for each expression.)

    LOAD_NAME:

    What is performed in this case is pretty straight-forward. Given our name, l or list in each case, (each name is found in `co->co_names, a tuple that stores the names we use inside the code object) the steps are:

    1. Look for the name inside locals. If found, go to 4.
    2. Look for the name inside globals. If found, go to 4.
    3. Look for the name inside builtins. If found, go to 4.
    4. If found, push value denoted by the name onto the stack. Else, NameError.

    In case A, the name l is found in the globals. In case B, it is found in the builtins. So, after the LOAD_NAME, the stack looks like:

    Case A: stack_pointer -> [1, 2, 3, 4]

    Case B: stack_pointer -> <type list>

    LOAD_METHOD:

    First, I should not that this opcode is generated only if an attribute access is performed (i.e obj.attr). You could also grab a method and call it via a = obj.attr and then a() but that would result in a CALL_FUNCTION opcode generated (see further down for a bit more).

    After loading the name of the callable (reverse in both cases) we search the object on the top of the stack (either [1, 2, 3, 4] or list) for a method named reverse. This is done with _PyObject_GetMethod, its documentation states:

    Return 1 if a method is found, 0 if it's a regular attribute from __dict__ or something returned by using a descriptor protocol.

    A method is only found in Case A, when we access the attribute (reverse) through an instance of the list object. In case B, the callable is returned after the descriptor protocol is invoked, so the return value is 0 (but we do of-course get the object back!).

    Here we diverge on the value returned:

    In Case A:

    SET_TOP(meth);
    PUSH(obj);  // self
    

    We have a SET_TOP followed by a PUSH. We moved the method to the top of the stack and then we push the value again. In this case, stack_pointer now looks:

    stack_pointer -> [1, 2, 3, 4]
                     <reverse method of lists>
    

    In Case B we have:

    SET_TOP(NULL);
    Py_DECREF(obj);
    PUSH(meth);
    

    Again a SET_TOP followed by a PUSH. The reference count of obj (i.e list) is decreased because, as far as I can tell, it isn't really needed anymore. In this case, the stack now looks like so:

    stack_pointer -> <reverse method of lists>
                     NULL
    

    For case B, we have an additional LOAD_NAME. Following the previous steps, the stack for Case B now becomes:

    stack_pointer -> [1, 2, 3, 4]
                     <reverse method of lists>
                     NULL
    

    Pretty similar.

    CALL_METHOD:

    This doesn't make any modifications to the stack. Both cases result in a call to call_function passing the thread state, the stack pointer and the number of positional arguments (oparg).

    The only difference is in the expression used to pass the positional arguments.

    For case A we need to account for the implicit self that should be inserted as the first positional argument. Since the op code generated for it doesn't signal that a positional argument has been passed (because none have been explicitly passed):

    4 CALL_METHOD              0
    

    we call call_function with oparg + 1 = 0 + 1 = 1 to signal that one positional argument exists on the stack ([1, 2, 3, 4]).

    In case B, where we explicitly pass the instance as a first argument, this is accounted for:

    6 CALL_METHOD              1
    

    so the call to call_function can immediately pass oparg as the value for the positional arguments.

    What is the "unbound method" pushed onto the stack? Is it a "C function" (which one?) or a "Python object"?

    It is a Python object that wraps around a C function. The Python object is a method descriptor and the C function it wraps is list_reverse.

    All built-in methods and function are implemented in C. During initialization, CPython initializes all builtins (see list here) and adds wrappers around all the methods. These wrappers (objects) are descriptors that are used to implement Methods and Functions.

    When a method is retrieved from a class via one of its instances, it is said to be bound to that instance. This can be seen by peeking at the __self__ attribute assigned to it:

    m = [1, 2, 3, 4].reverse
    m()                        # use __self__ 
    print(m.__self__)          # [4, 3, 2, 1]
    

    This method can still be called even without the instance that qualifies it. It is bound to that instance. (NOTE: This is handled by the CALL_FUNCTION opcode, not the LOAD/CALL_METHOD ones).

    An unbound method is one that is not yet bound to an instance. list.reverse is unbound, it is waiting to be invoked through an instance to bind to it.

    Something being unbound doesn't mean that it cannot be called, list.reverse is called just fine if you explicitly pass the self argument yourself as an argument. Remember that methods are just special functions that (among other things) implicitly pass self as a first argument after being bound to an instance.

    How can I tell that it's the list_reverse function in the listobject.c.h file?

    This is easy, you can see the list's methods getting initialized in listobject.c. LIST_REVERSE_METHODDEF is simply a macro that, when substituted, adds the list_reverse function to that list. The tp_methods of a list are then wrapped inside function objects as stated previously.

    Things might seem complicated here because CPython uses an internal tool, argument clinic, to automate argument handling. This kinda moves definitions around, obfuscating slightly.