Why cython embeded plugins has higher performance in cpython interpreter than rust-c interface versions?

I would like to ask some questions about the underlying principles of python interpreters, because I didn't get much useful information during my own search.

I've been using rust to write python plugins lately, this gives a significant speedup to python's cpu-intensive tasks, and it's also faster to write comparing to c. However it has one disadvantage is that, compared to the old scheme of using cython to accelerate, the call overhead of rust (I'm using pyo3) seems to be greater than that of c(I'm using cython),

For example , we got an empty python function here:

def empty_function():
    return 0

Call it a million times over in Python via a for loop and count the time, so that we can find out each single call takes about 70 nanosecond(in my pc).

And if we compile it to a cython plugin, with the same source code:

# test.pyx
cpdef unsigned int empty_function():
    return 0

The execution time will be reduced to 40 nanoseconds. Which means that we can use cython for some fine-grained embedding, and we can expect it to always execute faster than native python.

However when it comes to Rust, (Honesty speaking, I prefer to use rust for plugin development rather than cython now cause there's no need to do some weird hacking in grammar), the call time will increase to 140 nanoseconds, almost twice as much as native python. Source code as follow:

use pyo3::prelude::*;
use pyo3::wrap_pyfunction;

#[pyfunction]
fn empty_function() -> usize {
    0
}

#[pymodule]
fn testlib(_py: Python, m: &PyModule) -> PyResult<()> {
    m.add_function(wrap_pyfunction!(empty_function, m)?)?;
    Ok(())
}

This means that rust is not suitable for fine-grained embedded replacement of python. If there is a task whose call time is very few and each call takes a long time, then it is perfect to use rust. However if there's a task will be called a lot in the code, then it seems not suitable for rust , cause the overhead of type conversion will take up most of the accelerated time.

I want to know if this can be solved and, more importantly, I want to know the underlying rationale for this discrepancy. Is there some kind of difference with the cpython interpreter when calling between them, like the difference between cpython and pypy when calling c plugins? Where can I get further information? Thanks.

===

Update:

Sorry guys, I didn't anticipate that my question would be ambiguous, after all, the source code for all three has been given, and using timeit to test function runtimes is an almost convention in python development.

My test code is nearly all the same with @Jmb 's code in comment, with some subtle differences that I'm using python setup.py build_ext --inplace way to build instead of bare gcc, but that should not make any difference. Anyway, thanks for supplementary.

Solution

As suggested in the comments, this is a self-answer.

Since the discussion in the comments section did not lead to a clear conclusion, I went to raise an issue in pyo3's repo and get response from whose main maintainer.

In short, the conclusion is that there is no fundamental difference between the plugins compiled by pyo3 or cython when cpython calling them. The current speed difference comes from the different depth of optimization.

Here is the link to the issue: https://github.com/PyO3/pyo3/issues/1470