I have a large df:link. The data is grouped by col, row, year, No, potveg and total. I am trying to get the max value of 'total' column in a specific year of a group ONLY if its 'Possible' value is TRUE. If the max 'total' value is FALSE, then get the second max value, and so on.
If all the values of the 'possible' column in a specific year group = false, then I want to pick the max out of those so that I don't skip any years.
i.e., for the dataset below:
col row year No potveg total Possible
-125 42.5 2015 1 9 697.3 FALSE
2015 2 13 535.2 TRUE
2015 3 15 82.3 TRUE
2016 1 9 907.8 TRUE
2016 2 13 137.6 FALSE
2016 3 15 268.4 TRUE
2017 1 9 961.9 FALSE
2017 2 13 74.2 TRUE
2017 3 15 248 TRUE
2018 1 9 937.9 TRUE
2018 2 13 575.6 TRUE
2018 3 15 215.5 FALSE
-135 70.5 2015 1 8 697.3 FALSE
2015 2 10 535.2 TRUE
2015 3 19 82.3 TRUE
2016 1 8 907.8 TRUE
2016 2 10 137.6 FALSE
2016 3 19 268.4 TRUE
2017 1 8 961.9 FALSE
2017 2 10 74.2 TRUE
2017 3 19 248 TRUE
2018 1 8 937.9 TRUE
2018 2 10 575.6 TRUE
2018 3 19 215.5 FALSE
The output would be:
col row year No potveg total Possible
-125 42.5 2015 2 13 535.2 TRUE
2016 1 9 907.8 TRUE
2017 3 15 248 TRUE
2018 1 9 937.9 TRUE
-135 70.5 2015 2 10 535.2 TRUE
2016 1 8 907.8 TRUE
2017 3 19 248 TRUE
2018 1 8 937.9 TRUE
I tried:
df1 = pd.merge(df.groupby(['col','row','year']).agg({'total':'max'}).reset_index(),
df,
on=['col', 'row', 'year', 'total'])
But that only selects the max value.
Thanks!
Using your new provided dataset, this is the solution I came up with. Not 100% sure it's what you're after, I re-read the question and I think it's not actually what you're after but I'll put it here anyway.
import pandas as pd
df = pd.read_csv('data.csv')
# Separate out the true and false possibilities.
true_df = df[df['Possible'] == True]
false_df = df[df['Possible'] == False]
# Group the true and the false values by their max 'total' values.
truth_aggregated_df = pd.merge(true_df.groupby(['col', 'row', 'year']).agg({'total': 'max'}).reset_index(),
true_df,
on=['col', 'row', 'year', 'total'])
false_aggregated_df = pd.merge(false_df.groupby(['col', 'row', 'year']).agg({'total': 'max'}).reset_index(),
false_df,
on=['col', 'row', 'year', 'total'])
# Iterate through the false_aggregated_df to create a list on indexes in the where the following conditions are met:
# the indexes in the list will have their 'col', 'row', 'No' and 'potveg' in the 'truth_aggregated_df'.
# the indexes in the list will NOT have their 'year' in the 'truth_aggregated_df'.
# This way, we'll collect the indexes where a year is not represented for a specific 'col', 'row', 'No' and 'potveg'
# combination
false_indexes_to_keep = list()
for false_index, false_row in false_aggregated_df.iterrows():
false_col_in_true_df = false_row['col'] in truth_aggregated_df['col'].values
false_row_in_true_df = false_row['row'] in truth_aggregated_df['row'].values
false_no_in_true_df = false_row['No'] in truth_aggregated_df['No'].values
false_potveg_in_true_df = false_row['potveg'] in truth_aggregated_df['potveg'].values
if false_col_in_true_df is False or false_row_in_true_df is False or false_no_in_true_df is False \
or false_potveg_in_true_df is False:
continue
# If we know we have a 'col', 'row, 'No' and 'potveg' that's represented in the truth_aggregated_df
# we next need to check if the truth_aggregated_df is missing any years that exists in the `false_row` being
# iterated on. If so, we make note of the corresponding false index.
truth_aggregated_df_subset = truth_aggregated_df[(truth_aggregated_df['col'] == false_row['col']) &
(truth_aggregated_df['row'] == false_row['row']) &
(truth_aggregated_df['No'] == false_row['No']) &
(truth_aggregated_df['potveg'] == false_row['potveg'])]
false_year_not_in_true_df = false_row['year'] not in truth_aggregated_df_subset['year'].values
if false_col_in_true_df and false_row_in_true_df and false_no_in_true_df and false_potveg_in_true_df and \
false_year_not_in_true_df:
false_indexes_to_keep.append(false_index)
# Create the subset of the false_aggregated_df using the list we just made
false_subset_df = false_aggregated_df[false_aggregated_df.index.isin(false_indexes_to_keep)]
concat_df = pd.concat([truth_aggregated_df, false_subset_df])
concat_df = concat_df.sort_values(by=['col', 'row', 'year', 'potveg', 'No'])
print(concat_df)
That output dataframe will contain the max 'total' value for each year for each 'col', 'row', 'No' and 'potveg' combination. It will fill in the missing 'col', 'row', 'No' and 'potveg' combinations for each years from the dataframe that contains the False
values in the 'Possible' column.
There's almost certainly a better way, but this is the way that I think is the most simple. Generally iterating through a dataframe is a bad idea.