Search code examples
pythonpackagepython-module

Python `pkgutil.get_data` disrupts future imports


Consider the following package structure:

.
├── module
│   ├── __init__.py
│   └── submodule
│       ├── attribute.py
│       ├── data.txt
│       └── __init__.py
└── test.py

and the following piece of code:

import pkgutil
data = pkgutil.get_data('module.submodule', 'data.txt')
import module.submodule.attribute
retval = module.submodule.attribute.hello()

Running this will raise the error:

Traceback (most recent call last):
  File "test.py", line 7, in <module>
    retval = module.submodule.attribute.hello()
AttributeError: module 'module' has no attribute 'submodule'

However, if you run the following:

import pkgutil
import module.submodule.attribute
data = pkgutil.get_data('module.submodule', 'data.txt')
retval = module.submodule.attribute.hello()

or

import pkgutil
import module.submodule.attribute
retval = module.submodule.attribute.hello()

it works fine.

Why does running pkgutil.get_data disrupt the future import?


Solution

  • First of all, this was a great question and a great opportunity to learn something new about python's import system. So let's dig in!

    If we look at the implementation of pkgutil.get_data we see something like this:

    def get_data(package, resource):
        spec = importlib.util.find_spec(package)
        if spec is None:
            return None
        loader = spec.loader
        if loader is None or not hasattr(loader, 'get_data'):
            return None
        # XXX needs test
        mod = (sys.modules.get(package) or
               importlib._bootstrap._load(spec))
        if mod is None or not hasattr(mod, '__file__'):
            return None
    
        # Modify the resource name to be compatible with the loader.get_data
        # signature - an os.path format "filename" starting with the dirname of
        # the package's __file__
        parts = resource.split('/')
        parts.insert(0, os.path.dirname(mod.__file__))
        resource_name = os.path.join(*parts)
        return loader.get_data(resource_name)
    

    And the answer to your question is in this part of the code:

        mod = (sys.modules.get(package) or
               importlib._bootstrap._load(spec))
    

    It looks at the already loaded packages and if the package we're looking for (module.submodule in this example) exists it uses it and if not, then tries to load the package using importlib._bootstrap._load.

    So let's look at the implementation of importlib._bootstrap._load to see what's going on.

    def _load(spec):
        """Return a new module object, loaded by the spec's loader.
        The module is not added to its parent.
        If a module is already in sys.modules, that existing module gets
        clobbered.
        """
        with _ModuleLockManager(spec.name):
            return _load_unlocked(spec)
    

    Well, There's right there! The doc says "The module is not added to its parent."

    It means the submodule module is loaded but it's not added to the module module. So when we try to access the submodule via module there's no connection, hence the AtrributeError.

    It makes sense for the get_data method to use this function as it just wants some other file in the package and there is no need to import the whole package and add it to its parent and its parents' parent and so on.

    to see it yourself I suggest using a debugger and setting some breakpoints. Then you can see what happens step by step along the way.