python pandas python-decorators python-class

Python (pandas): Using decorators using pandas API

I'm quite new to decorators and classes in general on Python, but have a question if there is a better way to decorate pandas objects. An an example, I have written the following to create two methods -- lisa and wil:

import numpy as np
import pandas as pd

test = np.array([['john', 'meg', 2.23, 6.49],
       ['lisa', 'wil', 9.67, 8.87],
       ['lisa', 'fay', 3.41, 5.04],
       ['lisa', 'wil', 0.58, 6.12],
       ['john', 'wil', 7.31, 1.74]],
)
test = pd.DataFrame(test)
test.columns = ['name1','name2','scoreA','scoreB']

@pd.api.extensions.register_dataframe_accessor('abc')
class ABCDataFrame:

    def __init__(self, pandas_obj):
        self._obj = pandas_obj

    @property
    def lisa(self):
        return self._obj.loc[self._obj['name1'] == 'lisa']
    @property
    def wil(self):
        return self._obj.loc[self._obj['name2'] == 'wil']

Example output is as follows:

test.abc.lisa.abc.wil
  name1 name2 scoreA scoreB
1  lisa   wil   9.67   8.87
3  lisa   wil   0.58   6.12

I have two questions.

First, in practice, I am creating much more than two methods, and need to call many of them in the same line. Is there a way to get test.lisa.wil to return the same output as above where I wrote test.abc.lisa.abc.wil, since the former will save me from having to type the abc each time?

Second, if there are any other suggestions/resources on decorating pandas DataFrames, please let me know.

Solution

You can do this with the pandas-flavor library, which allows you to extend the DataFrame class with additional methods.

import pandas as pd
import pandas_flavor as pf

# Create test DataFrame as before.
test = pd.DataFrame([
    ['john', 'meg', 2.23, 6.49],
    ['lisa', 'wil', 9.67, 8.87],
    ['lisa', 'fay', 3.41, 5.04],
    ['lisa', 'wil', 0.58, 6.12],
    ['john', 'wil', 7.31, 1.74]
], columns=['name1', 'name2', 'scoreA', 'scoreB'])

# Register new methods.
@pf.register_dataframe_method
def lisa(df):
    return df.loc[df['name1'] == 'lisa']

@pf.register_dataframe_method
def wil(df):
    return df.loc[df['name2'] == 'wil']

Now it is possible to treat these as methods, without the intermediate .abc accessor.

test.lisa()                                                                                                                                                                                                                         
#   name1 name2  scoreA  scoreB
# 1  lisa   wil    9.67    8.87
# 2  lisa   fay    3.41    5.04
# 3  lisa   wil    0.58    6.12

test.lisa().wil()                                                                                                                                                                                                                   
#   name1 name2  scoreA  scoreB
# 1  lisa   wil    9.67    8.87
# 3  lisa   wil    0.58    6.12

Update

Since you have many of these, it is also possible to define a generic filtering method and then call it in some loops.

def add_method(key, val, fn_name=None):  
    def fn(df):
        return df.loc[df[key] == val]

    if fn_name is None:
        fn_name = f'{key}_{val}'

    fn.__name__ = fn_name
    fn = pf.register_dataframe_method(fn)
    return fn

for name1 in ['john', 'lisa']:
    add_method('name1', name1)

for name2 in ['fay', 'meg', 'wil']:
    add_method('name2', name2)

And then these become available as methods just as if you had defined the methods directly. Note that I have prefixed with the column name (name1 or name2) to be extra clear. That is optional.

test.name1_john()                                                                                                                                                                                                             
#   name1 name2  scoreA  scoreB
# 0  john   meg    2.23    6.49
# 4  john   wil    7.31    1.74

test.name1_lisa()                                                                                                                                                                                                                   
#   name1 name2  scoreA  scoreB
# 1  lisa   wil    9.67    8.87
# 2  lisa   fay    3.41    5.04
# 3  lisa   wil    0.58    6.12

test.name2_fay()                                                                                                                                                                                                                    
#   name1 name2  scoreA  scoreB
# 2  lisa   fay    3.41    5.04

Update 2

It is also possible for registered methods to have arguments. So another approach is to create one such method per column, with the value as an argument.

@pf.register_dataframe_method
def name1(df, val):
    return df.loc[df['name1'] == val]

@pf.register_dataframe_method
def name2(df, val):
    return df.loc[df['name2'] == val]

test.name1('lisa')
#   name1 name2  scoreA  scoreB
# 1  lisa   wil    9.67    8.87
# 2  lisa   fay    3.41    5.04
# 3  lisa   wil    0.58    6.12

test.name1('lisa').name2('wil')
#   name1 name2  scoreA  scoreB
# 1  lisa   wil    9.67    8.87
# 3  lisa   wil    0.58    6.12