Search code examples
pythonpython-3.xpandaspython-itertoolsfunctools

chain join multiple arguments from a list of variable size using a supplied bivariate function


I am looking to execute a function over all arguments in a list (map could do that part) and then "join" them using another function that could be exited early (say if the objective was to find an instance or reach a threshold).

Here is an example where the function is ~np.isnan over a variable number of columns from a data frame and the "join" is the bitwise & operator on the resulting boolean masks. So it finds if there are any NaN values in a data frame, where the location corresponds to a variable list of columns. I then removes the rows where a NaN is found for the supplied column names.

import pandas as pd
import numpy as np
import random

data_values = range(10)
column_names = list(map(lambda x: "C" + str(x), data_values))
data = pd.DataFrame(columns=column_names, data=np.reshape(np.repeat(data_values,10,0),(10,10)))
data.iloc[random.sample(data_values,random.sample(data_values,1)[0]),random.sample(data_values,random.sample(data_values,1)[0])] = np.nan
cols_to_check = random.sample(column_names,random.sample(data_values,1)[0])
# ideally: data.loc[pd.notnull(data[cols_to_check[0]]) & pd.notnull(data[cols_to_check[1]]) & ...]
# or perhaps: data.loc[chainFunc(pd.notnull, np.logical_and, cols_to_check)]
masks = [list(np.where(~np.isnan(data[x]))[0]) for x in cols_to_check]
data.iloc[list(set(masks[0]).intersection(*masks))]

This becomes extremely slow on large data frames but is it possible to generalize this using the itertools and functools and drastically improve performance? Say something like (pseudocode):

def chainFunc(func_applied, func_chain, args):
    x = func_applied(args[0])
    for arg_counter in range(len(args)-1):
        x = func_chain(x,func_applied(args[arg_counter+1]))
    return(x)

How would it work on the data frame example above?


Solution

  • I was looking for a generic way to combine an arbitrary list of arguments and apply the result on a data frame. I guess in the above example the application is close to dropNA but not exactly. I was looking for a combination of reduce and chain, there is no real pandas specific interface of this, but it is possible to get something working:

    import functools
    data.iloc[ np.where(functools.reduce(lambda x, y: x & y, 
                                         map(lambda z: pd.notnull(data[z]), 
                                             cols_to_check)))[0] ]