Search code examples
pythonpandaspython-decoratorspython-class

Python (pandas): Using decorators using pandas API


I'm quite new to decorators and classes in general on Python, but have a question if there is a better way to decorate pandas objects. An an example, I have written the following to create two methods -- lisa and wil:

import numpy as np
import pandas as pd

test = np.array([['john', 'meg', 2.23, 6.49],
       ['lisa', 'wil', 9.67, 8.87],
       ['lisa', 'fay', 3.41, 5.04],
       ['lisa', 'wil', 0.58, 6.12],
       ['john', 'wil', 7.31, 1.74]],
)
test = pd.DataFrame(test)
test.columns = ['name1','name2','scoreA','scoreB']

@pd.api.extensions.register_dataframe_accessor('abc')
class ABCDataFrame:

    def __init__(self, pandas_obj):
        self._obj = pandas_obj

    @property
    def lisa(self):
        return self._obj.loc[self._obj['name1'] == 'lisa']
    @property
    def wil(self):
        return self._obj.loc[self._obj['name2'] == 'wil']

Example output is as follows:

test.abc.lisa.abc.wil
  name1 name2 scoreA scoreB
1  lisa   wil   9.67   8.87
3  lisa   wil   0.58   6.12

I have two questions.

First, in practice, I am creating much more than two methods, and need to call many of them in the same line. Is there a way to get test.lisa.wil to return the same output as above where I wrote test.abc.lisa.abc.wil, since the former will save me from having to type the abc each time?

Second, if there are any other suggestions/resources on decorating pandas DataFrames, please let me know.


Solution

  • You can do this with the pandas-flavor library, which allows you to extend the DataFrame class with additional methods.

    import pandas as pd
    import pandas_flavor as pf
    
    # Create test DataFrame as before.
    test = pd.DataFrame([
        ['john', 'meg', 2.23, 6.49],
        ['lisa', 'wil', 9.67, 8.87],
        ['lisa', 'fay', 3.41, 5.04],
        ['lisa', 'wil', 0.58, 6.12],
        ['john', 'wil', 7.31, 1.74]
    ], columns=['name1', 'name2', 'scoreA', 'scoreB'])
    
    # Register new methods.
    @pf.register_dataframe_method
    def lisa(df):
        return df.loc[df['name1'] == 'lisa']
    
    @pf.register_dataframe_method
    def wil(df):
        return df.loc[df['name2'] == 'wil']
    

    Now it is possible to treat these as methods, without the intermediate .abc accessor.

    test.lisa()                                                                                                                                                                                                                         
    #   name1 name2  scoreA  scoreB
    # 1  lisa   wil    9.67    8.87
    # 2  lisa   fay    3.41    5.04
    # 3  lisa   wil    0.58    6.12
    
    test.lisa().wil()                                                                                                                                                                                                                   
    #   name1 name2  scoreA  scoreB
    # 1  lisa   wil    9.67    8.87
    # 3  lisa   wil    0.58    6.12
    

    Update

    Since you have many of these, it is also possible to define a generic filtering method and then call it in some loops.

    def add_method(key, val, fn_name=None):  
        def fn(df):
            return df.loc[df[key] == val]
    
        if fn_name is None:
            fn_name = f'{key}_{val}'
    
        fn.__name__ = fn_name
        fn = pf.register_dataframe_method(fn)
        return fn
    
    for name1 in ['john', 'lisa']:
        add_method('name1', name1)
    
    for name2 in ['fay', 'meg', 'wil']:
        add_method('name2', name2)
    

    And then these become available as methods just as if you had defined the methods directly. Note that I have prefixed with the column name (name1 or name2) to be extra clear. That is optional.

    test.name1_john()                                                                                                                                                                                                             
    #   name1 name2  scoreA  scoreB
    # 0  john   meg    2.23    6.49
    # 4  john   wil    7.31    1.74
    
    test.name1_lisa()                                                                                                                                                                                                                   
    #   name1 name2  scoreA  scoreB
    # 1  lisa   wil    9.67    8.87
    # 2  lisa   fay    3.41    5.04
    # 3  lisa   wil    0.58    6.12
    
    test.name2_fay()                                                                                                                                                                                                                    
    #   name1 name2  scoreA  scoreB
    # 2  lisa   fay    3.41    5.04
    

    Update 2

    It is also possible for registered methods to have arguments. So another approach is to create one such method per column, with the value as an argument.

    @pf.register_dataframe_method
    def name1(df, val):
        return df.loc[df['name1'] == val]
    
    @pf.register_dataframe_method
    def name2(df, val):
        return df.loc[df['name2'] == val]
    
    test.name1('lisa')
    #   name1 name2  scoreA  scoreB
    # 1  lisa   wil    9.67    8.87
    # 2  lisa   fay    3.41    5.04
    # 3  lisa   wil    0.58    6.12
    
    test.name1('lisa').name2('wil')
    #   name1 name2  scoreA  scoreB
    # 1  lisa   wil    9.67    8.87
    # 3  lisa   wil    0.58    6.12