python-3.x pandas dataframe group-by max

Pandas: Get max value of a group ONLY if the value satisfies given conditions

I have a large df:link. The data is grouped by col, row, year, No, potveg and total. I am trying to get the max value of 'total' column in a specific year of a group ONLY if its 'Possible' value is TRUE. If the max 'total' value is FALSE, then get the second max value, and so on.

If all the values of the 'possible' column in a specific year group = false, then I want to pick the max out of those so that I don't skip any years.

i.e., for the dataset below:

col   row   year    No  potveg  total   Possible
                        
-125 42.5   2015    1   9       697.3   FALSE
            2015    2   13      535.2   TRUE
            2015    3   15      82.3    TRUE
            2016    1   9       907.8   TRUE
            2016    2   13      137.6   FALSE
            2016    3   15      268.4   TRUE
            2017    1   9       961.9   FALSE
            2017    2   13      74.2    TRUE
            2017    3   15      248     TRUE
            2018    1   9       937.9   TRUE
            2018    2   13      575.6   TRUE
            2018    3   15      215.5   FALSE
-135 70.5   2015    1   8       697.3   FALSE
            2015    2   10      535.2   TRUE
            2015    3   19      82.3    TRUE
            2016    1   8       907.8   TRUE
            2016    2   10      137.6   FALSE
            2016    3   19      268.4   TRUE
            2017    1   8       961.9   FALSE
            2017    2   10      74.2    TRUE
            2017    3   19      248     TRUE
            2018    1   8       937.9   TRUE
            2018    2   10      575.6   TRUE
            2018    3   19      215.5   FALSE

The output would be:

col   row   year    No  potveg  total   Possible
                        
-125 42.5   2015    2   13      535.2   TRUE
            2016    1   9       907.8   TRUE
            2017    3   15      248     TRUE
            2018    1   9       937.9   TRUE
            
-135 70.5   2015    2   10      535.2   TRUE
            2016    1   8       907.8   TRUE
            2017    3   19      248     TRUE
            2018    1   8       937.9   TRUE

I tried:

df1 = pd.merge(df.groupby(['col','row','year']).agg({'total':'max'}).reset_index(), 
               df, 
               on=['col', 'row', 'year', 'total'])

But that only selects the max value.

Thanks!

Solution

Using your new provided dataset, this is the solution I came up with. Not 100% sure it's what you're after, I re-read the question and I think it's not actually what you're after but I'll put it here anyway.

import pandas as pd

df = pd.read_csv('data.csv')

# Separate out the true and false possibilities.
true_df = df[df['Possible'] == True]
false_df = df[df['Possible'] == False]

# Group the true and the false values by their max 'total' values.
truth_aggregated_df = pd.merge(true_df.groupby(['col', 'row', 'year']).agg({'total': 'max'}).reset_index(),
                               true_df,
                               on=['col', 'row', 'year', 'total'])
false_aggregated_df = pd.merge(false_df.groupby(['col', 'row', 'year']).agg({'total': 'max'}).reset_index(),
                               false_df,
                               on=['col', 'row', 'year', 'total'])

# Iterate through the false_aggregated_df to create a list on indexes in the where the following conditions are met:
#   the indexes in the list will have their 'col', 'row', 'No' and 'potveg' in the 'truth_aggregated_df'.
#   the indexes in the list will NOT have their 'year' in the 'truth_aggregated_df'.
# This way, we'll collect the indexes where a year is not represented for a specific 'col', 'row', 'No' and 'potveg'
# combination
false_indexes_to_keep = list()
for false_index, false_row in false_aggregated_df.iterrows():
    false_col_in_true_df = false_row['col'] in truth_aggregated_df['col'].values
    false_row_in_true_df = false_row['row'] in truth_aggregated_df['row'].values
    false_no_in_true_df = false_row['No'] in truth_aggregated_df['No'].values
    false_potveg_in_true_df = false_row['potveg'] in truth_aggregated_df['potveg'].values
    if false_col_in_true_df is False or false_row_in_true_df is False or false_no_in_true_df is False \
            or false_potveg_in_true_df is False:
        continue
    # If we know we have a 'col', 'row, 'No' and 'potveg' that's represented in the truth_aggregated_df
    # we next need to check if the truth_aggregated_df is missing any years that exists in the `false_row` being
    # iterated on.  If so, we make note of the corresponding false index.
    truth_aggregated_df_subset = truth_aggregated_df[(truth_aggregated_df['col'] == false_row['col']) &
                                                     (truth_aggregated_df['row'] == false_row['row']) &
                                                     (truth_aggregated_df['No'] == false_row['No']) &
                                                     (truth_aggregated_df['potveg'] == false_row['potveg'])]
    false_year_not_in_true_df = false_row['year'] not in truth_aggregated_df_subset['year'].values
    if false_col_in_true_df and false_row_in_true_df and false_no_in_true_df and false_potveg_in_true_df and \
            false_year_not_in_true_df:
        false_indexes_to_keep.append(false_index)

# Create the subset of the false_aggregated_df using the list we just made
false_subset_df = false_aggregated_df[false_aggregated_df.index.isin(false_indexes_to_keep)]

concat_df = pd.concat([truth_aggregated_df, false_subset_df])
concat_df = concat_df.sort_values(by=['col', 'row', 'year', 'potveg', 'No'])
print(concat_df)

That output dataframe will contain the max 'total' value for each year for each 'col', 'row', 'No' and 'potveg' combination. It will fill in the missing 'col', 'row', 'No' and 'potveg' combinations for each years from the dataframe that contains the False values in the 'Possible' column.

There's almost certainly a better way, but this is the way that I think is the most simple. Generally iterating through a dataframe is a bad idea.