Search code examples
python-3.xdataloader

support multiple dataloaders in python


I need to create a data interface that can query the same data either from a excel file or an API or our DB.

What would be the best structure to set this up and how would this normally be set-up to avoid having to manually switch imports based on whether we need data from the excel/api/db.


Solution

  • You can use a Driver/Factory pattern here. Basically, you need to write data drivers, to fetch the data from different endpoints. In all cases the data is the same.

    This is a standard OOP use case, and abstractions play a vital role in such designs. What you need is a standard interface/abstraction for the known operations, and implement it across different driver implementations.

    In your case, you know the data is a concrete object, and you need a loader (which is synonymous with a driver) that generates this data.

    So, define the data object. For. eg.

    class MyData:
        def __init__(self, *args, **kwargs):
            # TODO- Accept the relevant args for data object here!
            pass
    

    You could have any add-ons here. Now, what you need is a Data loader, which is basically abstracted. You decide the loader implementation in run time. So, first, decide on an abstraction which could be something like below

    from abc import abstractmethod
    class AbstractDataLoader:
    
        def __init__(self, *args, **kwargs):
            pass
    
        @abstractmethod
        def load(self, *args, **kwargs) -> MyData:
            pass
    

    The skeleton is defined. Now you need to define the various data loaders you need, which pick the data from different endpoints like a DB or File or API etc. Let's create some implementations like below.

    class DBDataLoader(AbstractDataLoader):
        def __init__(self, *args, **kwargs):
            super().__init__(*args, **kwargs)
            # load db connections, other configs
    
        def load(self, *args, **kwargs) -> MyData:
            # TODO- Load data from DB
            pass
    
    
    class ExcelDataLoader(AbstractDataLoader):
        def __init__(self, *args, **kwargs):
            super().__init__(*args, **kwargs)
            # load excel files, other configs
    
        def load(self, *args, **kwargs) -> MyData:
            # TODO- Load data from Excel
            pass
    
    
    class APIDataLoader(AbstractDataLoader):
        def __init__(self, *args, **kwargs):
            super().__init__(*args, **kwargs)
            # load api connections, other configs
    
        def load(self, *args, **kwargs) -> MyData:
            # TODO- Load data from API
            pass
    

    Done, we have the data object, drivers ready. Now it's about configuring and using a certain driver. This can be done either imperative, using a factory approach like below

    class MyApp:
        def __init__(self, configured_loader):
            self.configured_loader = configured_loader
    
        def _resolve_loader(self):
            if self.configured_loader == 'db':
                return DBDataLoader()
            elif self.configured_loader == 'excel':
                return ExcelDataLoader()
            # ....
    
        def load_data(self) -> MyData:
            return self._resolve_loader().load()
    
    
    if __name__ == '__main__':
        import sys
    
        loader = sys.argv[1]
        app = MyApp(loader)
        data = app.load_data()
        # Do with it whatever you want!
    

    Or, a better approach to use a declarative manner, by using configurations like an env file. Eg., define a env file like app.env with some definitions like

    myapp:data-loader=loaders.APIDataLoader
    myapp:data-loader:api:endpoint=https://some-server/api/v1/data
    ..
    ..
    ..
    

    And use a library like python-dotenv, to make it available in runtime and then load the data using the class directly.

    For eg.,

    import os
    import importlib
    from dotenv import load_dotenv
    class MyApp:
        def __init__(self):
            self.configured_loader = os.getenv("myapp:data-loader")
    
        def _resolve_loader(self):
            package_name, class_name = self.configured_loader.rsplit('.', 1)
            module = importlib.import_module(package_name)
            driver_class = getattr(module, class_name)
            return driver_class()
            # .... as of now, it creates an instance of APIDataLoader
    
        def load_data(self) -> MyData:
            return self._resolve_loader().load()
    
    
    if __name__ == '__main__':
        # Loads the configs from app.env..
        load_dotenv(dotenv_path='app.env')
        app = MyApp()
        data = app.load_data()
        # Do with it whatever you want!
    

    This summarizes a simple but extensible approach to your problem.