I'm quite new to decorators and classes in general on Python, but have a question if there is a better way to decorate pandas objects. An an example, I have written the following to create two methods -- lisa and wil:
import numpy as np
import pandas as pd
test = np.array([['john', 'meg', 2.23, 6.49],
['lisa', 'wil', 9.67, 8.87],
['lisa', 'fay', 3.41, 5.04],
['lisa', 'wil', 0.58, 6.12],
['john', 'wil', 7.31, 1.74]],
)
test = pd.DataFrame(test)
test.columns = ['name1','name2','scoreA','scoreB']
@pd.api.extensions.register_dataframe_accessor('abc')
class ABCDataFrame:
def __init__(self, pandas_obj):
self._obj = pandas_obj
@property
def lisa(self):
return self._obj.loc[self._obj['name1'] == 'lisa']
@property
def wil(self):
return self._obj.loc[self._obj['name2'] == 'wil']
Example output is as follows:
test.abc.lisa.abc.wil
name1 name2 scoreA scoreB
1 lisa wil 9.67 8.87
3 lisa wil 0.58 6.12
I have two questions.
First, in practice, I am creating much more than two methods, and need to call many of them in the same line. Is there a way to get test.lisa.wil
to return the same output as above where I wrote test.abc.lisa.abc.wil
, since the former will save me from having to type the abc
each time?
Second, if there are any other suggestions/resources on decorating pandas DataFrames, please let me know.
You can do this with the pandas-flavor library, which allows you to extend the DataFrame
class with additional methods.
import pandas as pd
import pandas_flavor as pf
# Create test DataFrame as before.
test = pd.DataFrame([
['john', 'meg', 2.23, 6.49],
['lisa', 'wil', 9.67, 8.87],
['lisa', 'fay', 3.41, 5.04],
['lisa', 'wil', 0.58, 6.12],
['john', 'wil', 7.31, 1.74]
], columns=['name1', 'name2', 'scoreA', 'scoreB'])
# Register new methods.
@pf.register_dataframe_method
def lisa(df):
return df.loc[df['name1'] == 'lisa']
@pf.register_dataframe_method
def wil(df):
return df.loc[df['name2'] == 'wil']
Now it is possible to treat these as methods, without the intermediate .abc
accessor.
test.lisa()
# name1 name2 scoreA scoreB
# 1 lisa wil 9.67 8.87
# 2 lisa fay 3.41 5.04
# 3 lisa wil 0.58 6.12
test.lisa().wil()
# name1 name2 scoreA scoreB
# 1 lisa wil 9.67 8.87
# 3 lisa wil 0.58 6.12
Update
Since you have many of these, it is also possible to define a generic filtering method and then call it in some loops.
def add_method(key, val, fn_name=None):
def fn(df):
return df.loc[df[key] == val]
if fn_name is None:
fn_name = f'{key}_{val}'
fn.__name__ = fn_name
fn = pf.register_dataframe_method(fn)
return fn
for name1 in ['john', 'lisa']:
add_method('name1', name1)
for name2 in ['fay', 'meg', 'wil']:
add_method('name2', name2)
And then these become available as methods just as if you had defined the methods directly. Note that I have prefixed with the column name (name1
or name2
) to be extra clear. That is optional.
test.name1_john()
# name1 name2 scoreA scoreB
# 0 john meg 2.23 6.49
# 4 john wil 7.31 1.74
test.name1_lisa()
# name1 name2 scoreA scoreB
# 1 lisa wil 9.67 8.87
# 2 lisa fay 3.41 5.04
# 3 lisa wil 0.58 6.12
test.name2_fay()
# name1 name2 scoreA scoreB
# 2 lisa fay 3.41 5.04
Update 2
It is also possible for registered methods to have arguments. So another approach is to create one such method per column, with the value as an argument.
@pf.register_dataframe_method
def name1(df, val):
return df.loc[df['name1'] == val]
@pf.register_dataframe_method
def name2(df, val):
return df.loc[df['name2'] == val]
test.name1('lisa')
# name1 name2 scoreA scoreB
# 1 lisa wil 9.67 8.87
# 2 lisa fay 3.41 5.04
# 3 lisa wil 0.58 6.12
test.name1('lisa').name2('wil')
# name1 name2 scoreA scoreB
# 1 lisa wil 9.67 8.87
# 3 lisa wil 0.58 6.12