Search code examples
pythonpandaslambdaapplymulti-index

Accessing groups in Pandas lambda function


I have a Pandas dataframe with a multiindex. Level 0 is 'Strain' and level 1 is 'JGI library.' Each 'Strain' has several 'JGI library' columns associated with it. I would like to use a lambda function to apply a t-test to compare two different strains. To troubleshoot, I have been taking one row of my dataframe using the .iloc[0] command.

row = pvalDf.iloc[0]
parent = 'LL1004'
child = 'LL345'
ttest_ind(row.groupby(level='Strain').get_group(parent), row.groupby(level='Strain').get_group(child))[1]

This works as expected. Now I try to apply it to my whole dataframe

parent = 'LL1004'
child = 'LL345'
pvalDf = countsDf4.apply(lambda row: ttest_ind(row.groupby(level='Strain').get_group(parent), row.groupby(level='Strain').get_group(child))[1]) 

Now I get an error message saying, "ValueError: ('level name Strain is not the name of the index', 'occurred at index (LL1004, BCHAC)')"

'LL1004' is a 'Strain,' but Pandas doesn't seem to be aware of this. It looks like maybe the multiindex was not passed to the lambda function correctly? Is there a better way to troubleshoot lambda functions than using .iloc[0]?

I put a copy of my Jupyter notebook and an excel file with the countsDf4 dataframe on Github https://github.com/danolson1/pandas_ttest

Thanks, Dan


Solution

  • How about, more simply:

    pvalDf = countsDf4.apply(lambda row: ttest_ind(row[parent], row[child]), axis=1)
    

    I've tested it on your notebook and it works.

    Your problem is that DataFrame.apply() by default applies the function to each column, not to each row. So, you need to specify the axis=1 parameter to override the default behavior and apply the function row by row.

    Also, there's no reason to use row.groupby(level='Strain').get_group(x) when you could simply index the group of columns by row[x]. :)