I have hourly data for each day of the year for 11 years. I want to find the maximum ozone value each day (I have 6 ozone columns), but I want to keep the entire row of data for the hour mark that the maximum value occurred, incluiding the date. I have a lot of meteorological parameters stored in this dataframe (like temperature, wind speed, etc) and don't want the code to drop any of the other columns. I want the code to show me every detail of the line, including the datetime
column, that the maximum ozone value occurs. The important component of this code is that I need the maximum ozone value for any given day, while preserving all other rows of data, including the datetime column, meaning I want to see the year-month-day-hour that the maximum ozone value occurred.
Here's the link to my csv file: https://drive.google.com/file/d/1iyrvbD9gPHoTmwhSxo8aPfgfAIbpOBK6/view?usp=sharing
Here's the code I have tried thus far:
import pandas as pd
df = pd.read_csv('')
df.set_index('date', inplace=True)
df.index = pd.to_datetime(df.index)
#these are the columns I want to perform the action on:
ozone = ['MDA8_3135', 'MDA8_2551', 'MDA8_3186', 'MDA8_2878', 'MDA8_2199', 'MDA8_3168']
mda8 = df.loc[df.groupby([df.index.year, df.index.month, df.index.day], as_index=False)[ozone].idxmax()]
I have also tried:
df['day'] = df.index.day
df['year'] = df.index.year
df['month'] = df.index.month
df = df.reset_index()
mda8 = df.loc[df.groupby(['year', 'month', 'day'], as_index=False)[ozone].idxmax()]
I also tried an individual column:
mda8 = df.loc[df.groupby(['year', 'month', 'day'], as_index=False)["MDA8_3135"].max()]
I keep getting an error message that says ValueError: Cannot index with multidimensional key
no matter what method I use. I assume it's because of the groupby, but someone has done it before and got it to work:
Get the row(s) which have the max value in groups using groupby
EDIT
DATA SAMPLE:
df = pd.DataFrame({'date':['2009-01-01 00:00:00',
'2009-01-01 01:00:00',
'2009-01-01 02:00:00',
'2009-01-01 03:00:00',
'2009-01-02 04:00:00',
'2009-01-02 05:00:00',
'2009-01-02 06:00:00',
'2009-01-02 07:00:00',
'2009-01-03 08:00:00',
'2009-01-03 09:00:00'],
'nox_3135(ppb)':[20,29,27,31,33,14,34,23,32,31],
'CO_3135(ppm)':[0.8,0.9,0.1,0.2,0.5,0.5,0.7,0.9,0.9,0.3],
'MDA8_3135':[42,45,47,51,52,52,57,67,65,70],
'pm25_3135_2018':[6,7,6,7,4,5,2,11,9,18]})
Okay, so notice that there are multiple times for the same day. I want to pull the daily maximum ozone value from MDA8_3135
column, and I want to include all other rows from that specific maximum value.
The new dataframe would look like:
new = pd.DataFrame({'date':['2009-01-01 03:00:00','2009-01-02 07:00:00',
'2009-01-03 09:00:00'],
'nox_3135(ppb)':[31,23,31],
'CO_3135(ppm)':[0.2,0.9,0.3],
'MDA8_3135':[51,67,70],
'pm25_3135_2018':[7,11,18]})
In my actual dataframe, I have 6 MDA8 columns, so I want to perform the code on all of these columns INDIVIDUALLY and create 6 new dataframes with the maximum ozone concentration and all the other columns that correspond to the specific timestamp that the maximum ozone value occurs on a given day.
Based on your clarification that you want 6 separate dataframes, compute them in a comprehension.
For a given ozone column, use groupby.idxmax
to find the date index of the max ozone value. Since some dates are NaT
, dropna
before indexing the matching rows with loc
:
ozone = ['MDA8_3135', 'MDA8_2551', 'MDA8_3186', 'MDA8_2878', 'MDA8_2199', 'MDA8_3168']
df = pd.read_csv('IV_hourly_dataframe_Clim_Anom.csv', parse_dates=['date'], index_col=['date'])
out = {col: df.loc[df[col].groupby(df.index.date).idxmax().dropna()] for col in ozone}
Output using IV_hourly_dataframe_Clim_Anom.csv
on pandas 1.3.3 and python 3.9.7:
>>> out['MDA8_3135']
# T_3186 T_5408 ... MDA8_3135 ... CO_3135_da8_anom CO_2199_da8_anom
# date
# 2009-01-01 11:00:00 16.0 NaN ... 35.0000 ... 1.096090 0.004785
# 2009-01-02 12:00:00 18.0 NaN ... 32.1250 ... 0.258590 0.117285
# ... ... ... ... ... ... ... ...
# 2019-12-31 12:00:00 20.3 18.35 ... 37.2000 ... -0.616592 -0.079884
>>> out['MDA8_2551']
# T_3186 T_5408 ... MDA8_2551 ... CO_3135_da8_anom CO_2199_da8_anom
# date
# 2009-01-01 12:00:00 17.0 NaN ... 34.2500 ... 0.383590 0.042285
# 2009-01-02 12:00:00 18.0 NaN ... 28.7500 ... 0.258590 0.117285
# ... ... ... ... ... ... ... ...
# 2019-12-31 12:00:00 20.3 18.35 ... 32.7875 ... -0.616592 -0.079884
...
>>> out['MDA8_3168']
# T_3186 T_5408 ... MDA8_3168 ... CO_3135_da8_anom CO_2199_da8_anom
# date
# 2009-01-01 12:00:00 17.0 NaN ... 39.250000 ... 0.383590 0.042285
# 2009-01-02 14:00:00 19.0 NaN ... 58.000000 ... 0.508590 0.117285
# ... ... ... ... ... ... ... ...
# 2019-12-31 11:00:00 19.6 16.83 ... 44.187500 ... -0.620967 -0.084234