Function call overhead - why do builtin Python builtins appear to be faster than my builtins?

I've been interested in overheads, so I wrote a minimal C extension exporting two functions nop and starnop that do more or less nothing. They just pass through their input (the two relevant functions are right at the top the rest is just tedious boiler plate code):

amanmodule.c:

#include <Python.h>

static PyObject* aman_nop(PyObject *self, PyObject *args)
{
  PyObject *obj;

  if (!PyArg_UnpackTuple(args, "arg", 1, 1, &obj))
    return NULL;
  Py_INCREF(obj);
  return obj;
}

static PyObject* aman_starnop(PyObject *self, PyObject *args)
{
  Py_INCREF(args);
  return args;
}

static PyMethodDef AmanMethods[] = {
  {"nop",  (PyCFunction)aman_nop, METH_VARARGS,
   PyDoc_STR("nop(arg) -> arg\n\nReturn arg unchanged.")},
  {"starnop", (PyCFunction)aman_starnop, METH_VARARGS,
   PyDoc_STR("starnop(*args) -> args\n\nReturn tuple of args unchanged")},
  {NULL, NULL}
};

static struct PyModuleDef amanmodule = {
    PyModuleDef_HEAD_INIT,
    "aman",
    "aman - a module about nothing.\n\n"
    "Provides functions 'nop' and 'starnop' which do nothing:\n"
    "nop(arg) -> arg; starnop(*args) -> args\n",
    -1,
    AmanMethods
};

PyMODINIT_FUNC
PyInit_aman(void)
{
    return PyModule_Create(&amanmodule);
}

setup.py:

from setuptools import setup, extension

setup(name='aman', version='1.0',
      ext_modules=[extension.Extension('aman', ['amanmodule.c'])],
      author='n.n.',
      description="""aman - a module about nothing

      Provides functions 'nop' and 'starnop' which do nothing:
      nop(arg) -> arg; starnop(*args) -> args
      """,
      license='public domain',
      keywords='nop pass-through identity')

Next, I time them against pure Python implementations and a couple of builtins that also do next to nothing:

import numpy as np
from aman import nop, starnop
from timeit import timeit

def mnsd(x): return '{:8.6f} \u00b1 {:8.6f} \u00b5s'.format(np.mean(x), np.std(x))

def pnp(x): x

globals={}
for globals['nop'] in (int, bool, (0).__add__, hash, starnop, nop, pnp, lambda x: x):
    print('{:60s}'.format(repr(globals['nop'])),
          mnsd([timeit('nop(1)', globals=globals) for i in range(10)]),
          '  ',
          mnsd([timeit('nop(True)',globals=globals) for i in range(10)]))

First Question I'm not doing something retarded methodology-wise?

Results for 10 blocks of 1,000,000 calls each:

<class 'int'>                                                0.099754 ± 0.003917 µs    0.103933 ± 0.000585 µs
<class 'bool'>                                               0.097711 ± 0.000661 µs    0.094412 ± 0.000612 µs
<method-wrapper '__add__' of int object at 0x8c7000>         0.065146 ± 0.000728 µs    0.064976 ± 0.000605 µs
<built-in function hash>                                     0.039546 ± 0.000671 µs    0.039566 ± 0.000452 µs
<built-in function starnop>                                  0.056490 ± 0.000873 µs    0.056234 ± 0.000181 µs
<built-in function nop>                                      0.060094 ± 0.000799 µs    0.059959 ± 0.000170 µs
<function pnp at 0x7fa31c0512f0>                             0.090452 ± 0.001077 µs    0.098479 ± 0.003314 µs
<function <lambda> at 0x7fa31c051378>                        0.086387 ± 0.000817 µs    0.086536 ± 0.000714 µs

Now my actual question: even though my nops are written in C and do nothing (starnop doesn't even parse its arguments) the builtin hash function is consistently faster. I know that ints are their own hash values in Python, so hash also is a nop here but it isn't nopper than my nops, so why the speed difference?

Update: Completely forgot: I'm on a pretty standard x86_64 machine, linux gcc4.8.5. The extension I install using python3 setup.py install --user.

Solution

Much (most?) of the overhead in Python function calls is the creation of the args tuple. The argument parsing also adds some overhead.

Functions defines using the the METH_VARARGS calling convention require the creation of a tuple to store all the arguments. If you just need a single argument, you can use the METH_O calling convention. With METH_O, no tuple is created. The single argument is passed directly. I've added a nop1 to your example which uses METH_O.

It's possible define functions that do not require an argument using METH_NOARGS. See nop2 for the least possible overhead.

When using METH_VARARGS, it is possible to decrease the overhead slightly by directly parsing the args tuple instead of calling PyArg_UnpackTuple or the related PyArg_ functions. It is slightly faster. See nop3.

The builtin hash() function used the METH_O calling convention.

Modified amanmodule.c

#include <Python.h>

static PyObject* aman_nop(PyObject *self, PyObject *args)
{
  PyObject *obj;

  if (!PyArg_UnpackTuple(args, "arg", 1, 1, &obj))
    return NULL;
  Py_INCREF(obj);
  return obj;
}

static PyObject* aman_nop1(PyObject *self, PyObject *other)
{
  Py_INCREF(other);
  return other;
}

static PyObject* aman_nop2(PyObject *self)
{
  Py_RETURN_NONE;
}

static PyObject* aman_nop3(PyObject *self, PyObject *args)
{
  PyObject *obj;

  if (PyTuple_GET_SIZE(args) == 1) {
    obj = PyTuple_GET_ITEM(args, 0);
    Py_INCREF(obj);
    return obj;
  }
  else {
    PyErr_SetString(PyExc_TypeError, "nop3 requires 1 argument");
    return NULL;
  }
}

static PyObject* aman_starnop(PyObject *self, PyObject *args)
{
  Py_INCREF(args);
  return args;
}

static PyMethodDef AmanMethods[] = {
  {"nop",  (PyCFunction)aman_nop, METH_VARARGS,
   PyDoc_STR("nop(arg) -> arg\n\nReturn arg unchanged.")},
  {"nop1",  (PyCFunction)aman_nop1, METH_O,
   PyDoc_STR("nop(arg) -> arg\n\nReturn arg unchanged.")},
  {"nop2",  (PyCFunction)aman_nop2, METH_NOARGS,
   PyDoc_STR("nop(arg) -> arg\n\nReturn arg unchanged.")},
  {"nop3",  (PyCFunction)aman_nop3, METH_VARARGS,
   PyDoc_STR("nop(arg) -> arg\n\nReturn arg unchanged.")},
  {"starnop", (PyCFunction)aman_starnop, METH_VARARGS,
   PyDoc_STR("starnop(*args) -> args\n\nReturn tuple of args unchanged")},
  {NULL, NULL}
};

static struct PyModuleDef amanmodule = {
    PyModuleDef_HEAD_INIT,
    "aman",
    "aman - a module about nothing.\n\n"
    "Provides functions 'nop' and 'starnop' which do nothing:\n"
    "nop(arg) -> arg; starnop(*args) -> args\n",
    -1,
    AmanMethods
};

PyMODINIT_FUNC
PyInit_aman(void)
{
    return PyModule_Create(&amanmodule);
}

Modified test.py

import numpy as np
from aman import nop, nop1, nop2, nop3, starnop
from timeit import timeit

def mnsd(x): return '{:8.6f} \u00b1 {:8.6f} \u00b5s'.format(np.mean(x), np.std(x))

def pnp(x): x

globals={}
for globals['nop'] in (int, bool, (0).__add__, hash, starnop, nop, nop1, nop3, pnp, lambda x: x):
    print('{:60s}'.format(repr(globals['nop'])),
          mnsd([timeit('nop(1)', globals=globals) for i in range(10)]),
          '  ',
          mnsd([timeit('nop(True)',globals=globals) for i in range(10)]))

# To test with no arguments
for globals['nop'] in (nop2,):
    print('{:60s}'.format(repr(globals['nop'])),
          mnsd([timeit('nop()', globals=globals) for i in range(10)]),
          '  ',
          mnsd([timeit('nop()',globals=globals) for i in range(10)]))

Results

$ python3 test.py  
<class 'int'>                                                0.080414 ± 0.004360 µs    0.086166 ± 0.003216 µs
<class 'bool'>                                               0.080501 ± 0.008929 µs    0.075601 ± 0.000598 µs
<method-wrapper '__add__' of int object at 0xa6dca0>         0.045652 ± 0.004229 µs    0.044146 ± 0.000114 µs
<built-in function hash>                                     0.035122 ± 0.003317 µs    0.033419 ± 0.000136 µs
<built-in function starnop>                                  0.044056 ± 0.001300 µs    0.044280 ± 0.001629 µs
<built-in function nop>                                      0.047297 ± 0.000777 µs    0.049536 ± 0.007577 µs
<built-in function nop1>                                     0.030402 ± 0.001423 µs    0.031249 ± 0.002352 µs
<built-in function nop3>                                     0.044673 ± 0.004041 µs    0.042936 ± 0.000177 µs
<function pnp at 0x7f946342d840>                             0.071846 ± 0.005377 µs    0.071085 ± 0.003314 µs
<function <lambda> at 0x7f946342d8c8>                        0.066621 ± 0.001499 µs    0.067163 ± 0.002962 µs
<built-in function nop2>                                     0.027736 ± 0.001487 µs    0.027035 ± 0.000397 µs