Search code examples
pythonscikit-learncython

sklearn internals access cython classes and functions


I am interested in testing out many of the internal classes and functions defined within sklearn (eg. maybe add print statement to the treebuilder so I can see how the tree got built). However as many of the internals were written in Cython, I want to learn what is the best practices and workflows of testing out the functions in Jupyter notebook.

For example, I managed to import the Stack class from the tree._utils module. I was even able to construct it but unable to call any of the methods. Any thoughts on what I should do in order to call and test the cdef classes and its methods in Python?

%%cython 
from sklearn.tree import _utils
s = _utils.Stack(10)
print(s.top())
# AttributeError: 'sklearn.tree._utils.Stack' object has no attribute 'top'

Solution

  • There are some problems which must be solved in order to be able to use c-interfaces of the internal classes.

    First problem (skip if your sklearn version is >=0.21.x):

    Until version 0.21.x sklearn used implicit relative imports (as in Python2), compiling it with Cython's language_level=3 (default in IPython3) would not work - so setting language_level=2 is needed for versions < 0.21.x (i.e. %cython -2) or even better, scikit-learn should be updated.

    Second problem:

    We need to include path to numpy-headers. Let's take a look at a simpler version:

    %%cython 
    from sklearn.tree._tree cimport Node
    print("loaded")
    

    which fails with nothing saying error "command 'gcc' failed with exit status 1" - but the real reason can be seen in the terminal, where gcc outputs its error message (and not to notebook):

    fatal error: numpy/arrayobject.h: No such file or directory compilation terminated.

    _tree.pxd uses numpy-API and thus we need to provide the location of numpy-headers.

    That means we need to add include_dirs=[numpy.get_include()] to Extension definition. There are two ways to do it in %%cython-magic, via -I option:

    %%cython -I <path from numpy.get_include()>
    ...
    

    or somewhat dirtier trick, exploiting that %%cython magic will add the include automatically when it sees string "numpy", by adding a comment like

    %%cython 
    # requires numpy headers
    ... 
    

    is enough.

    Last but not least:

    Note: since 0.22 this is no longer an issue as pxd-files are included into the installation (see this).

    The pxd-files must be present in the installation for us to be able to cimport them. This is the case for pxd-files from the sklearn.tree subpackage, as one can see in the local setup.py-file (given this PR, this seems to be more or less a random decision without a strategy behind):

    ...
    config.add_data_files("_criterion.pxd")
    config.add_data_files("_splitter.pxd")
    config.add_data_files("_tree.pxd")
    config.add_data_files("_utils.pxd")
    ...
    

    but not for some other cython-extensions, in particular not for sklearn.neighbors-subpackage. Now, that is a problem for your example:

    %%cython 
    # requires numpy headers 
    from sklearn.tree._utils cimport Stack
    s = Stack(10)
    print(s.top())
    

    fails to be cythonized, because _utils.pxd cimports data structures from neighbors/*.pxd's:

    ...
    from sklearn.neighbors.quad_tree cimport Cell
    ...
    

    which are not present in the installation.

    The situation is described with more details in this SO-post, your options to build are (as described in the link)

    • copy pdx-files to installation
    • reinstall from the downloaded source with pip install -e
    • reinstall from the downloaded source after manipulating corresponding local setup.py-files.

    Another option is to ask the developers of sklearn to include pxd-files into the installation, so not only building but also distribution becomes possible.