Search code examples
pythonpandaskolmogorov-smirnov

MultiIndex Dataframe compare one index raw against others


I have a dataframe with lists as values.

index=pd.MultiIndex.from_product([["file1", "file2", "file3"], ["a", "b"]])
index.names = ['file', 'name']
data = [
    [[1,1],[0,0]],
    [[],[]],
    [[2,2,2],[7]],
    [[],[]],
    [[1],[4, 4]],    
    [[],[]],
]
df = pd.DataFrame(data, index=index, columns=['col1', 'col2'])
df

df

                col1    col2
file    name        
file1   a      [1, 1]   [0, 0]
        b       []      []
file2   a    [2, 2, 2]  [7]
        b       []      []
file3   a       [1]     [4, 4]
        b       []      []

I want to group by name and run a Kolmogorov-Smirnov test (scipy.stats.ks_2samp) between each row and a concatenation of others rows. Example for name a. {file1,a} == [1,1]. Concatenation of others {file2,a} + {file3,a} == [2,2,2] + [1] == [2,2,2,1]. KStest between them is stats.ks_2samp([1,1], [2,2,2,1]) == 0.75. How can I get the expected result below (done it by hand manually)?

               col1     col2
file    name        
file1   a       0.75    1.0
        b       NaN     NaN
file2   a       1.0     1.0
        b       NaN     NaN
file3   a       0.6     0.66
        b       NaN     NaN

I'm sorry if this is too adhoc question.

Below is my attempt. I couldn't figure out how elegantly exclude the target row from other rows.

df.groupby(['name']).apply(
    lambda per_name_df: per_name_df.apply(
        lambda per_column: per_column.apply(
            lambda cell: stats.ks_2samp(cell, np.concatenate(per_column.to_numpy())) if cell else cell)))


Solution

  • ... test between a single row and a concatenation of others rows

    as you didn't specify which rows in particular, I'll give you an example to test between first row and all remaining rows:

    from scipy.stats import ks_2samp
    def ks(a, b):
        b = [el for li in b for el in li]
        if a and b:
            return ks_2samp(a, b)[0]
    
    df.groupby(df.index.get_level_values('name')).col1.apply(lambda x: ks(x[0],x[1:].to_list()))
    

    Result:

    name
    a    0.75
    b     NaN
    Name: col1, dtype: float64
    


    Update for edited question:

    ... test between each row and a concatenation of others rows

    def ks_all(a):
        a = a.to_list()
        return [ks(a[i],a[:i]+a[i+1:]) for i in range(0,len(a))]
    
    df.groupby(df.index.get_level_values('name')).transform(ks_all)
    

    Result:

                col1      col2
    file  name                
    file1 a     0.75  1.000000
          b      NaN       NaN
    file2 a     1.00  1.000000
          b      NaN       NaN
    file3 a     0.60  0.666667
          b      NaN       NaN