sklearn internals access cython classes and functions

I am interested in testing out many of the internal classes and functions defined within sklearn (eg. maybe add print statement to the treebuilder so I can see how the tree got built). However as many of the internals were written in Cython, I want to learn what is the best practices and workflows of testing out the functions in Jupyter notebook.

For example, I managed to import the Stack class from the tree._utils module. I was even able to construct it but unable to call any of the methods. Any thoughts on what I should do in order to call and test the cdef classes and its methods in Python?

%%cython 
from sklearn.tree import _utils
s = _utils.Stack(10)
print(s.top())
# AttributeError: 'sklearn.tree._utils.Stack' object has no attribute 'top'

Solution

There are some problems which must be solved in order to be able to use c-interfaces of the internal classes.

First problem (skip if your sklearn version is >=0.21.x):

Until version 0.21.x sklearn used implicit relative imports (as in Python2), compiling it with Cython's language_level=3 (default in IPython3) would not work - so setting language_level=2 is needed for versions < 0.21.x (i.e. %cython -2) or even better, scikit-learn should be updated.

Second problem:

We need to include path to numpy-headers. Let's take a look at a simpler version:

%%cython 
from sklearn.tree._tree cimport Node
print("loaded")

which fails with nothing saying error "command 'gcc' failed with exit status 1" - but the real reason can be seen in the terminal, where gcc outputs its error message (and not to notebook):

fatal error: numpy/arrayobject.h: No such file or directory compilation terminated.

_tree.pxd uses numpy-API and thus we need to provide the location of numpy-headers.

That means we need to add include_dirs=[numpy.get_include()] to Extension definition. There are two ways to do it in %%cython-magic, via -I option:

%%cython -I <path from numpy.get_include()>
...

or somewhat dirtier trick, exploiting that %%cython magic will add the include automatically when it sees string "numpy", by adding a comment like

%%cython 
# requires numpy headers
...

is enough.

Last but not least:

Note: since 0.22 this is no longer an issue as pxd-files are included into the installation (see this).

The pxd-files must be present in the installation for us to be able to cimport them. This is the case for pxd-files from the sklearn.tree subpackage, as one can see in the local setup.py-file (given this PR, this seems to be more or less a random decision without a strategy behind):

...
config.add_data_files("_criterion.pxd")
config.add_data_files("_splitter.pxd")
config.add_data_files("_tree.pxd")
config.add_data_files("_utils.pxd")
...

but not for some other cython-extensions, in particular not for sklearn.neighbors-subpackage. Now, that is a problem for your example:

%%cython 
# requires numpy headers 
from sklearn.tree._utils cimport Stack
s = Stack(10)
print(s.top())

fails to be cythonized, because _utils.pxd cimports data structures from neighbors/*.pxd's:

...
from sklearn.neighbors.quad_tree cimport Cell
...

which are not present in the installation.

The situation is described with more details in this SO-post, your options to build are (as described in the link)

copy pdx-files to installation
reinstall from the downloaded source with pip install -e
reinstall from the downloaded source after manipulating corresponding local setup.py-files.

Another option is to ask the developers of sklearn to include pxd-files into the installation, so not only building but also distribution becomes possible.