Search code examples
pythonpandasdataframerowgroup

How to create function that removes certain pandas dataframe rows per id


I have a pandas dataframe that looks something like this:

id col1 col2 value1 value2 value3
1 123456 1234ABC 1 2 nan
1 123456 1234567 1 2 nan
1 124567 1234568 1 2 nan
1 124567 2345678 nan 2 nan
2 123456 1234564 nan 2 nan
2 123456 2132534 nan 2 nan
2 543210 10580701 nan 2 nan

I want to make a function that runs through the whole set and cleans it with these conditions:

For every unique id, do the following steps:

  • If col 1 has 6 digit code and col 2 has number and letter combination:
    • Then keep row
  • If col 1 has 6 digit code and col 2 has something else than number and letter combination
    • Then keep the first row with same 6 digit code in col1.

So in this table example, after running the function, these rows would still be in the dataset:

id col1 col2 value1 value2 value3
1 123456 1234ABC 1 2 nan
1 123456 1234567 1 2 nan
1 124567 2345678 nan 2 nan
2 123456 1234564 nan 2 nan
2 543210 10580701 nan 2 nan

At first i tried something like this:

def process_df(df):
    
    # Sort the dataframe by column 1 and column 2
    df = df.sort_values(by=['col1', 'col2'])
    
    # Create a new column that indicates whether a row has a letter in column 2
    df['has_letter'] = df['col2'].str.contains('[a-zA-Z]')
    
    # Group the dataframe by column 1 and apply the following function to each group
    def group_func(group):
        # If there are any rows with a letter in column 2, keep all of them
        if group['has_letter'].any():
            return group
        # If there are no rows with a letter in column 2, keep the first row
        else:
            return group.iloc[0:1]
    df = df.groupby('col1').apply(group_func)
    
    # Drop the has_letter column
    df = df.drop(columns=['has_letter'])
    df=df.reset_index(drop=True)
    
    return df

But it didn't work since every unique id might have rows where col1 6 digit code is same than some other ids col1 6 digit code

So somehow I have to make a function that does this to every id separately so it would work.

EDIT:

I edited the

df = df.groupby('col1').apply(group_func)

row to

df = df.groupby(['id', 'col1']).apply(group_func)

This seems? to do the job.


Solution

  • First I grouped by id, then I extracted the rows that had letters in col2, Iterating on the groups I further grouped by col1 and from these I extracted the first occurrence of the col1 value. I hope this can help you.

    # Sort the dataframe by column 1 and column 2
    df4 = df4.sort_values(by=['col1', 'col2'])
    
    # Create a new column that indicates whether a row has a letter in column 2
    df4['has_letter'] = df4['col2'].str.contains('[a-zA-Z]')
    # Fillna 
    df4['has_letter'] = df4['has_letter'].fillna(False)
    grouped_id = df4.groupby('id')
    output_df = pd.DataFrame()
    # Iterate over group_id
    for name, group_id in grouped_id:
        output_df = output_df.append(group_id[group_id['has_letter']])
        no_col2_letter = group_id[group_id['has_letter']==False]
        grouped_col2 = no_col2_letter.groupby('col1')
        for name, group_col2 in grouped_col2:
            output_df = output_df.append(group_col2[:1])
    output_df