Search code examples
pandasgroup-byiterationfilteringhierarchical

How to write a function that iterates over groups of pandas dataframe and applies hierarchical conditions?


I need to write a function to filter a dataset based on some hierarchical conditions. The purpose of this function is to get one annotation for each protein among a bunch of them.

The function needs to do the following,

  1. Group the proteins by "id" column
  2. Check each analysis in the protein group
  3. Pick the analysis in a prioritized way (It's not a sorting issue so the analysis should be given in a conditional order so that it could be re-ordered later on). First, check analysis_1 if it doesn't exist then take analysis_2 or analysis_3, etc.
  4. Get the row into a new data frame.

Here is an example of data that will be filtered,

df=pd.DataFrame({
'id':      ['Protein_1', 'Protein_1', 'Protein_1', 
          'Protein_2','Protein_2','Protein_2'],
'analysis': ['analysis_6', 'analysis_4', 'analysis_1', 
                 'analysis_3','analysis_2','analysis_5'],
'annotation':['annotation_1', 'annotation_2', 'annotation_3',
              'annotation_1','annotation_2','annotation_3'] })

and this is the output I'd like to see,

df_filtered= pd.DataFrame({
'id':      ['Protein_1','Protein_2'],
'analysis': ['analysis_1', 'analysis_2'],
'annotation':['annotation_3', 'annotation_2'] })

The code in the following is working but I'd like to do it by using pandas groupby, apply, and iterrows functions.

new_df =pd.DataFrame(columns=df.columns)
protein_id=list(df.id.unique())

for protein in protein_id:

data=df[df["id"] == protein]

if len(data[data["analysis"] =="analysis_1"]) == 0:
    
    if len(data[data["analysis"] =="analysis_2"]) == 0:
        
        if len(data[data["analysis"] =="analysis_3"]) == 0:
            pass
        else:
            data2=data[data["analysis"] =="analysis_3"]
            new_df = pd.concat([new_df,data2])
        
    else:
        data2=data[data["analysis"] =="analysis_2"]
        new_df = pd.concat([new_df,data2])
    
else:
    data2=data[data["analysis"] =="analysis_1"]
    new_df = pd.concat([new_df,data2])
    
new_df

Appreciate any help!!


Solution

  • You could temporarily sort the dataframe, then drop all but one entries for each id. It looks like this:

    df.sort_values('analysis').drop_duplicates(['id'], keep='first')
    

    Note, that this doesn't change the order in your original dataframe. The result looks like this:

              id    analysis    annotation
    2  Protein_1  analysis_1  annotation_3
    4  Protein_2  analysis_2  annotation_2
    

    In case you have a function that returns the priority of an analysis, you can use it in combination with the method above:

    def prio_function(analysis):
        # return a low number for a better result
        # and a high number for a worse result
        return int(analysis.split('_')[1])  # replace this row by your code
    
    df_work= df.assign(_prio=df['analysis'].apply(prio_function))
    df_work.sort_values('_prio').drop_duplicates(['id'], keep='first').drop(columns='_prio')
    

    If priorization is simpler, you can also pass a dicitionary to apply instead of a function.