support multiple dataloaders in python

I need to create a data interface that can query the same data either from a excel file or an API or our DB.

What would be the best structure to set this up and how would this normally be set-up to avoid having to manually switch imports based on whether we need data from the excel/api/db.

Solution

You can use a Driver/Factory pattern here. Basically, you need to write data drivers, to fetch the data from different endpoints. In all cases the data is the same.

This is a standard OOP use case, and abstractions play a vital role in such designs. What you need is a standard interface/abstraction for the known operations, and implement it across different driver implementations.

In your case, you know the data is a concrete object, and you need a loader (which is synonymous with a driver) that generates this data.

So, define the data object. For. eg.

class MyData:
    def __init__(self, *args, **kwargs):
        # TODO- Accept the relevant args for data object here!
        pass

You could have any add-ons here. Now, what you need is a Data loader, which is basically abstracted. You decide the loader implementation in run time. So, first, decide on an abstraction which could be something like below

from abc import abstractmethod
class AbstractDataLoader:

    def __init__(self, *args, **kwargs):
        pass

    @abstractmethod
    def load(self, *args, **kwargs) -> MyData:
        pass

The skeleton is defined. Now you need to define the various data loaders you need, which pick the data from different endpoints like a DB or File or API etc. Let's create some implementations like below.

class DBDataLoader(AbstractDataLoader):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        # load db connections, other configs

    def load(self, *args, **kwargs) -> MyData:
        # TODO- Load data from DB
        pass


class ExcelDataLoader(AbstractDataLoader):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        # load excel files, other configs

    def load(self, *args, **kwargs) -> MyData:
        # TODO- Load data from Excel
        pass


class APIDataLoader(AbstractDataLoader):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        # load api connections, other configs

    def load(self, *args, **kwargs) -> MyData:
        # TODO- Load data from API
        pass

Done, we have the data object, drivers ready. Now it's about configuring and using a certain driver. This can be done either imperative, using a factory approach like below

class MyApp:
    def __init__(self, configured_loader):
        self.configured_loader = configured_loader

    def _resolve_loader(self):
        if self.configured_loader == 'db':
            return DBDataLoader()
        elif self.configured_loader == 'excel':
            return ExcelDataLoader()
        # ....

    def load_data(self) -> MyData:
        return self._resolve_loader().load()


if __name__ == '__main__':
    import sys

    loader = sys.argv[1]
    app = MyApp(loader)
    data = app.load_data()
    # Do with it whatever you want!

Or, a better approach to use a declarative manner, by using configurations like an env file. Eg., define a env file like app.env with some definitions like

myapp:data-loader=loaders.APIDataLoader
myapp:data-loader:api:endpoint=https://some-server/api/v1/data
..
..
..

And use a library like python-dotenv, to make it available in runtime and then load the data using the class directly.

For eg.,

import os
import importlib
from dotenv import load_dotenv
class MyApp:
    def __init__(self):
        self.configured_loader = os.getenv("myapp:data-loader")

    def _resolve_loader(self):
        package_name, class_name = self.configured_loader.rsplit('.', 1)
        module = importlib.import_module(package_name)
        driver_class = getattr(module, class_name)
        return driver_class()
        # .... as of now, it creates an instance of APIDataLoader

    def load_data(self) -> MyData:
        return self._resolve_loader().load()


if __name__ == '__main__':
    # Loads the configs from app.env..
    load_dotenv(dotenv_path='app.env')
    app = MyApp()
    data = app.load_data()
    # Do with it whatever you want!

This summarizes a simple but extensible approach to your problem.