Search code examples
pythonpandassplit-apply-combine

Filtering pandas groupby using value of column (string datatype)


I've been working on a large genomics data set that contains multiple reads of every sample to make sure we got the data, but when analyzing it we need to drop it down to one row so we don't skew the data (count the gene as present 6 times when it actually was one instance read multiple times). Every row has an ID so I used the pandas df.groupby() function on the ID. Here's a table to try and illustrate what I want to do:

# ID   |  functionality   |   v_region_score   |   constant_region 
# -----------------------------------------------------------------
# 123  |  productive      |      820           |      NaN
#      |  unknown         |      720           |      NaN
#      |  unknown         |      720           |      IgM
# 456  |  unknown         |      690           |      NaN
#      |  unknown         |      670           |      NaN
# 789  |  productive      |      780           |      IgM
#      |  productive      |      780           |      NaN

(Edit) Here is the code for an example dataframe:

df1 = pd.DataFrame([
    [789, "productive", 780, "IgM"],
    [123, "unknown", 720, np.nan],
    [123, "unknown", 720, "IgM"],
    [789, "productive", 780, np.nan],
    [123, "productive", 820, np.nan],
    [456, "unknown", 690, np.nan],
    [456, "unknown", 670, np.nan]], 
    columns=["ID", "functionality", "v_region_score", "constant_region"])

And this would be the final output with correct rows chosen:

df2 = pd.DataFrame([
    [789, "productive", 780, "IgM"],
    [123, "productive", 820, np.nan],
    [456, "unknown", 690, np.nan]], 
    columns=["ID", "functionality", "v_region_score", "constant_region"])

So after grouping, for each group if it has a "productive" value in functionality I want to keep that row, if it is "unknown" I take the highest v_region_score, and if there are multiple "productive" values, I take the one that has some value in its constant_region.

I've tried several ways of accessing these values:

id, frame = next(iter(df_grouped))

if frame["functionality"].equals("productive"):
    # do something

Looking at just one group:

x = df_grouped.get_group("1:1101:10897:22442")

for index, value in x["functionality"].items():
    print(value)

# returns the correct value and type "str"

Even putting each group into a list:

new_groups = []

for id, frame in df_grouped:
    new_groups.append(frame)

# access a specific index returns a dataframe
new_groups[30]

The error I get with all of these is "The truth value of a Series is ambiguous" which I now understand why this doesn't work, but I can't use a.any(), a.all(), or a.bool() because of how complicated the conditional is.

Is there any way I can choose a specific row in each group based on the values of its columns? Sorry for such a complicated question and thanks in advance! :)


Solution

  • You can approach your problem from a different angle:

    1. Sort values based on your criteria
    2. Group by ID
    3. Keep first result per grouped ID

    For example:

    df1 = df1.sort_values(['ID','functionality','v_region_score','constant_region'], ascending=[True,True,False,True], na_position='last')
    
    df1.groupby('ID').first().reset_index()
    
    Out[0]:
        ID functionality  v_region_score constant_region
    0  123    productive             820             IgM
    1  456       unknown             690             NaN
    2  789    productive             780             IgM
    
    

    Additionally, if you want to merge values from constant_region when it's null, you can use fillna(method='ffill') so you keep the value that exists:

    ## sorted here
    
    df1['constant_region'] = df1.groupby('ID')['constant_region'].fillna(method='ffill')
    
    df1
    Out[1]: 
        ID functionality  v_region_score constant_region
    4  123    productive             820             NaN
    2  123       unknown             720             IgM
    1  123       unknown             720             IgM
    5  456       unknown             690             NaN
    6  456       unknown             670             NaN
    0  789    productive             780             IgM
    3  789    productive             780             IgM
    
    ## Group by here