Search code examples
pythonpandasmaxpandas-groupbypandas-loc

Pandas multidimensional key error after combining loc and groupby


I have hourly data for each day of the year for 11 years. I want to find the maximum ozone value each day (I have 6 ozone columns), but I want to keep the entire row of data for the hour mark that the maximum value occurred, incluiding the date. I have a lot of meteorological parameters stored in this dataframe (like temperature, wind speed, etc) and don't want the code to drop any of the other columns. I want the code to show me every detail of the line, including the datetime column, that the maximum ozone value occurs. The important component of this code is that I need the maximum ozone value for any given day, while preserving all other rows of data, including the datetime column, meaning I want to see the year-month-day-hour that the maximum ozone value occurred.

Here's the link to my csv file: https://drive.google.com/file/d/1iyrvbD9gPHoTmwhSxo8aPfgfAIbpOBK6/view?usp=sharing

Here's the code I have tried thus far:

import pandas as pd

df = pd.read_csv('')
df.set_index('date', inplace=True)
df.index = pd.to_datetime(df.index)

#these are the columns I want to perform the action on:
ozone = ['MDA8_3135', 'MDA8_2551', 'MDA8_3186', 'MDA8_2878', 'MDA8_2199', 'MDA8_3168'] 

mda8 = df.loc[df.groupby([df.index.year, df.index.month, df.index.day], as_index=False)[ozone].idxmax()]

I have also tried:

df['day'] = df.index.day
df['year'] = df.index.year
df['month'] = df.index.month
df = df.reset_index()
mda8 = df.loc[df.groupby(['year', 'month', 'day'], as_index=False)[ozone].idxmax()]

I also tried an individual column:

mda8 = df.loc[df.groupby(['year', 'month', 'day'], as_index=False)["MDA8_3135"].max()]

I keep getting an error message that says ValueError: Cannot index with multidimensional key no matter what method I use. I assume it's because of the groupby, but someone has done it before and got it to work: Get the row(s) which have the max value in groups using groupby

EDIT

DATA SAMPLE:

df = pd.DataFrame({'date':['2009-01-01 00:00:00', 
'2009-01-01 01:00:00', 
'2009-01-01 02:00:00', 
'2009-01-01 03:00:00', 
'2009-01-02 04:00:00', 
'2009-01-02 05:00:00', 
'2009-01-02 06:00:00', 
'2009-01-02 07:00:00', 
'2009-01-03 08:00:00', 
'2009-01-03 09:00:00'], 
'nox_3135(ppb)':[20,29,27,31,33,14,34,23,32,31],
'CO_3135(ppm)':[0.8,0.9,0.1,0.2,0.5,0.5,0.7,0.9,0.9,0.3],
'MDA8_3135':[42,45,47,51,52,52,57,67,65,70],
'pm25_3135_2018':[6,7,6,7,4,5,2,11,9,18]})

Okay, so notice that there are multiple times for the same day. I want to pull the daily maximum ozone value from MDA8_3135 column, and I want to include all other rows from that specific maximum value. The new dataframe would look like:

new = pd.DataFrame({'date':['2009-01-01 03:00:00','2009-01-02 07:00:00', 
'2009-01-03 09:00:00'], 
'nox_3135(ppb)':[31,23,31],
'CO_3135(ppm)':[0.2,0.9,0.3],
'MDA8_3135':[51,67,70],
'pm25_3135_2018':[7,11,18]})

In my actual dataframe, I have 6 MDA8 columns, so I want to perform the code on all of these columns INDIVIDUALLY and create 6 new dataframes with the maximum ozone concentration and all the other columns that correspond to the specific timestamp that the maximum ozone value occurs on a given day.


Solution

  • Based on your clarification that you want 6 separate dataframes, compute them in a comprehension.

    For a given ozone column, use groupby.idxmax to find the date index of the max ozone value. Since some dates are NaT, dropna before indexing the matching rows with loc:

    ozone = ['MDA8_3135', 'MDA8_2551', 'MDA8_3186', 'MDA8_2878', 'MDA8_2199', 'MDA8_3168'] 
    df = pd.read_csv('IV_hourly_dataframe_Clim_Anom.csv', parse_dates=['date'], index_col=['date'])
    out = {col: df.loc[df[col].groupby(df.index.date).idxmax().dropna()] for col in ozone}
    

    Output using IV_hourly_dataframe_Clim_Anom.csv on pandas 1.3.3 and python 3.9.7:

    >>> out['MDA8_3135']
    #                      T_3186  T_5408  ...  MDA8_3135  ...  CO_3135_da8_anom  CO_2199_da8_anom
    # date
    # 2009-01-01 11:00:00    16.0     NaN  ...    35.0000  ...          1.096090          0.004785
    # 2009-01-02 12:00:00    18.0     NaN  ...    32.1250  ...          0.258590          0.117285
    # ...                     ...     ...  ...        ...  ...               ...               ...
    # 2019-12-31 12:00:00    20.3   18.35  ...    37.2000  ...         -0.616592         -0.079884
    
    >>> out['MDA8_2551']
    #                      T_3186  T_5408  ...  MDA8_2551  ...  CO_3135_da8_anom  CO_2199_da8_anom
    # date
    # 2009-01-01 12:00:00    17.0     NaN  ...    34.2500  ...          0.383590          0.042285
    # 2009-01-02 12:00:00    18.0     NaN  ...    28.7500  ...          0.258590          0.117285
    # ...                     ...     ...  ...        ...  ...               ...               ...
    # 2019-12-31 12:00:00    20.3   18.35  ...    32.7875  ...         -0.616592         -0.079884
    

    ...

    >>> out['MDA8_3168']
    #                      T_3186  T_5408  ...  MDA8_3168  ...  CO_3135_da8_anom  CO_2199_da8_anom
    # date
    # 2009-01-01 12:00:00    17.0     NaN  ...  39.250000  ...          0.383590          0.042285
    # 2009-01-02 14:00:00    19.0     NaN  ...  58.000000  ...          0.508590          0.117285
    # ...                     ...     ...  ...        ...  ...               ...               ...
    # 2019-12-31 11:00:00    19.6   16.83  ...  44.187500  ...         -0.620967         -0.084234