How to make CPython report vectorcall as available only when it will actually help performance?

The Vectorcall protocol is a new calling convention for Python's C API defined in PEP 590. The idea is to speed up calls in Python by avoiding the need to build intermediate tuples and dicts, and instead pass all arguments in a C array.

Python supports checking if a callable supports vectorcall by checking if the result of PyVectorcall_Function() is not NULL. However, it appears that functions support vectorcall even when using it will actually harm performance.

For example, take the following simple function:

def foo(*args): pass

This function won't benefit from vectorcall - because it collects args, Python needs to collect the arguments into a tuple anyway. So if I will allocate a tuple instead of a C style array, it will be faster. I also benchmarked this:

use std::hint::black_box;

use criterion::{criterion_group, criterion_main, Criterion};

use pyo3::conversion::ToPyObject;
use pyo3::ffi;
use pyo3::prelude::*;

fn criterion_benchmark(c: &mut Criterion) {
    Python::with_gil(|py| {
        let module = PyModule::from_code(
            py,
            cr#"
def foo(*args): pass
        "#,
            c"args_module.py",
            c"args_module",
        )
        .unwrap();
        let foo = module.getattr("foo").unwrap();
        let args_arr = black_box([
            1.to_object(py).into_ptr(),
            "a".to_object(py).into_ptr(),
            true.to_object(py).into_ptr(),
        ]);
        unsafe {
            assert!(ffi::PyVectorcall_Function(foo.as_ptr()).is_some());
        }
        c.bench_function("vectorcall - vectorcall", |b| {
            b.iter(|| unsafe {
                let args = vec![args_arr[0], args_arr[1], args_arr[2]];
                let result = black_box(ffi::PyObject_Vectorcall(
                    foo.as_ptr(),
                    args.as_ptr(),
                    3,
                    std::ptr::null_mut(),
                ));
                ffi::Py_DECREF(result);
            })
        });
        c.bench_function("vectorcall - regular call", |b| {
            b.iter(|| unsafe {
                let args = ffi::PyTuple_New(3);
                ffi::Py_INCREF(args_arr[0]);
                ffi::PyTuple_SET_ITEM(args, 0, args_arr[0]);
                ffi::Py_INCREF(args_arr[1]);
                ffi::PyTuple_SET_ITEM(args, 1, args_arr[1]);
                ffi::Py_INCREF(args_arr[2]);
                ffi::PyTuple_SET_ITEM(args, 2, args_arr[2]);
                let result =
                    black_box(ffi::PyObject_Call(foo.as_ptr(), args, std::ptr::null_mut()));
                ffi::Py_DECREF(result);
                ffi::Py_DECREF(args);
            })
        });
    });
}

criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);

The benchmark is in Rust and uses the convenient functions of the PyO3 framework, but the core work is done using raw FFI calls to the C API, so this shouldn't affect the results.

Results:

vectorcall - vectorcall time:   [51.008 ns 51.263 ns 51.530 ns]

vectorcall - regular call
                        time:   [35.638 ns 35.826 ns 36.022 ns]

The benchmark confirms my suspicion: Python has to do additional works when I use the vectorcall API.

On the other hand, the vectorcall API can be more performant than using tuples even when needing to allocate memory, for example when calling a bound method with the PY_VECTORCALL_ARGUMENTS_OFFSET flag. A benchmark confirms that too.

So here is my question: Is there a way to know when a vectorcall won't help and even do damage, or alternatively, when a vectorcall can help?

Context, even though I don't think it's relevant:

I'm experimenting with a pycall!() macro for PyO3. The macro has the ability to call with normal parameters, but also unpack parameters, and should do so in the most efficient way possible.

Using vectorcall where available sounds like a good idea; but then I'm facing this obstacle where I cannot know if I should prefer converting directly to a tuple or to a C-style array for vectorcall.

Solution

If it looks like PyObject_Call is faster for you, that's probably some sort of inefficiency on the Rust side of things, and you should look into optimizing that. Trying to bypass vectorcall doesn't actually provide the Python-side speedup you're thinking of. Particularly, the tuple you're creating is overhead, not an optimization.

For objects that support vectorcall, including standard function objects, PyObject_Call literally just uses vectorcall:

if (vector_func != NULL) {
    return _PyVectorcall_Call(tstate, vector_func, callable, args, kwargs);
}

Even a direct call to tp_call will just indirectly delegate to vectorcall, because for most types that support vectorcall (again including standard function objects), tp_call is set to PyVectorcall_Call.

So even if your function needs its arguments in a tuple, making a tuple for PyObject_Call doesn't actually save any work. PyVectorcall_Call will extract the tuple's item array:

/* Fast path for no keywords */
if (kwargs == NULL || PyDict_GET_SIZE(kwargs) == 0) {
    return func(callable, _PyTuple_ITEMS(tuple), nargs, NULL);
}

and then if the function needs a tuple, it will have to build a second tuple out of that array.

It would take a highly unusual custom callable type to actually

support vectorcall,
support tp_call without delegating to vectorcall, and
have tp_call be faster.

Adding extra code to check for this kind of highly unusual case, even if possible, would lose too much time on the usual cases to pay off.

And anyway, it's not possible to implement that extra code, in general. There is nothing like a tp_is_vectorcall_faster slot, or any other way for a type to signal that vectorcall is supported but should sometimes be avoided. You'd have to special-case individual types and implement type-specific handling.