Search code examples
python-3.xpandasdataframepandas-groupbytimeserieschart

Pandas grouping and resampling for a bar plot:


I have a dataframe that records concentrations for several different locations in different years, with a high temporal frequency (<1 hour). I am trying to make a bar/multibar plot showing mean concentrations, at different locations in different years

To calculate mean concentration, I have to apply quality control filters to daily and monthly data.

My approach is to first apply filters and resample per year and then do the grouping by location and year.

Also, out of all the locations (in the column titled locations) I have to choose only a few rows. So, I am slicing the original dataframe and creating a new dataframe with selected rows.

I am not able to achieve this using the following code:

date=df['date']
location = df['location']
df.date = pd.to_datetime(df.date)
year=df.date.dt.year
df=df.set_index(date)


df['Year'] = df['date'].map(lambda x: x.year )

#Location name selection/correction in each city:
#Changing all stations:
df['location'] = df['location'].map(lambda x: "M" if x == "mm" else x)

#New dataframe:
df_new = df[(df['location'].isin(['K', 'L', 'M']))]


#Data filtering:
df_new = df_new[df_new['value'] >= 0]

df_new.drop(df_new[df_new['value'] > 400].index, inplace = True)

df_new.drop(df_new[df_new['value'] <2].index, inplace = True)

diurnal = df_new[df_new['value']].resample('12h')

diurnal_mean = diurnal.mean()[diurnal.count() >= 9]

daily_mean=diurnal_mean.resample('d').mean()

df_month=daily_mean.resample('m').mean()

df_yearly=df_month[df_month['value']].resample('y')

#For plotting:

df_grouped = df_new.groupby(['location', 'Year']).agg({'value':'sum'}).reset_index()

sns.barplot(x='location',y='value',hue='Year',data= df_grouped)

This is one of the many errors that cropped up:

"None of [Float64Index([22.73, 64.81,  8.67, 19.98, 33.12, 37.81, 39.87, 42.29, 37.81,\n              36.51,\n              ...\n               11.0,  40.0,  23.0,  80.0,  50.0,  60.0,  40.0,  80.0,  80.0,\n               17.0],\n             dtype='float64', length=63846)] are in the [columns]"
ERROR:root:Invalid alias: The name clear can't be aliased because it is another magic command.

This is a sample dataframe, showing what I need to plot; value column should ideally represent resampled values, after performing the quality control operations and resampling.

Unnamed: 0 location  value  \
date                                    location          value                                                                         
2017-10-21 08:45:00+05:30        8335    M                339.3   
2017-08-18 17:45:00+05:30        8344    M                 45.1   
2017-11-08 13:15:00+05:30        8347    L                594.4   
2017-10-21 13:15:00+05:30        8659    N                189.9   
2017-08-18 15:45:00+05:30        8662    N                 46.5   

This is how the a part of the actual data should look like, after selecting the chosen locations. I am a new user so cannot attach a screenshot of the graph I require. This query is an extension of the query I had posted earlier , with the additional requirement of plotting resampled data instead of simple value counts. Iteration over years to plot different group values as bar plot in pandas

Any help will be much appreciated.


Solution

  • Fundamentally, your errors come with this unclear indexing where you are passing continuous, float values of one column for rowwise selection of index which currently is a datetime type.

    df_new[df_new['value']]           # INDEXING DATETIME USING FLOAT VALUES
    ...
    df_month[df_month['value']]       # COLUMN value DOES NOT EXIST
    

    Possibly, you meant to select the column value (out of the others) during resampling.

    diurnal = df_new['value'].resample('12h')
    
    diurnal.mean()[diurnal.count() >= 9]
    
    daily_mean = diurnal_mean.resample('d').mean()    
    df_month = daily_mean.resample('m').mean()       # REMOVE value BEING UNDERLYING SERIES
    df_yearly = df_month.resample('y')                
    

    However, no where above do you retain location for plotting. Hence, instead of resample, use groupby(pd.Grouper(...))

    # AGGREGATE TO KEEP LOCATION AND 12h
    diurnal = (df_new.groupby(["location", pd.Grouper(freq='12h')])["value"]
                     .agg(["count", "mean"])
                     .reset_index().set_index(['date'])
               )
    # FILTER
    diurnal_sub = diurnal[diurnal["count"] >= 9]
    
    # MULTIPLE DATE TIME LEVEL MEANS
    daily_mean = diurnal_sub.groupby(["location", pd.Grouper(freq='d')])["mean"].mean()
    df_month = diurnal_sub.groupby(["location", pd.Grouper(freq='m')])["mean"].mean()
    df_yearly = diurnal_sub.groupby(["location", pd.Grouper(freq='y')])["mean"].mean()
    
    print(df_yearly)
    

    To demonstrate with random, reproducible data:

    Data

    import numpy as np
    import pandas as pd
    
    import matplotlib.pyplot as plt
    import seaborn as sns
    
    np.random.seed(242020)
    random_df = pd.DataFrame({'date': (np.random.choice(pd.date_range('2017-01-01', '2019-12-31'), 5000) + 
                                       pd.to_timedelta(np.random.randint(60*60, 60*60*24, 5000), unit='s')),
                              'location': np.random.choice(list("KLM"), 5000),
                              'value': np.random.uniform(10, 1000, 5000)                          
                             })
    

    Aggregation

    loc_list = list("KLM")
    
    # NEW DATA FRAME WITH DATA FILTERING
    df = (random_df.set_index(random_df['date'])
                   .assign(Year = lambda x: x['date'].dt.year,
                           location = lambda x: x['location'].where(x["location"] != "mm", "M"))
                   .query('(location == @loc_list) and (value >= 2 and value <= 400)')
          )
    
    # 12h AGGREGATION
    diurnal = (df_new.groupby(["location", pd.Grouper(freq='12h')])["value"]
                     .agg(["count", "mean"])
                     .reset_index().set_index(['date'])
                     .query("count >= 2")
              )
    
    
    # d, m, y AGGREGATION
    daily_mean = diurnal.groupby(["location", pd.Grouper(freq='d')])["mean"].mean()
    df_month = diurnal.groupby(["location", pd.Grouper(freq='m')])["mean"].mean()
    df_yearly = (diurnal.groupby(["location", pd.Grouper(freq='y')])["mean"].mean()
                        .reset_index()
                        .assign(Year = lambda x: x["date"].dt.year)
                )
    
    print(df_yearly)
    #   location       date        mean  Year
    # 0        K 2017-12-31  188.984592  2017
    # 1        K 2018-12-31  199.521702  2018
    # 2        K 2019-12-31  216.497268  2019
    # 3        L 2017-12-31  214.347873  2017
    # 4        L 2018-12-31  199.232711  2018
    # 5        L 2019-12-31  177.689221  2019
    # 6        M 2017-12-31  222.412711  2017
    # 7        M 2018-12-31  241.597977  2018
    # 8        M 2019-12-31  215.554228  2019
    

    Plotting

    sns.set()
    fig, axs = plt.subplots(figsize=(12,5))
    sns.barplot(x='location', y='mean', hue='Year', data= df_yearly, ax=axs)
    
    plt.title("Location Value Yearly Aggregation", weight="bold", size=16)
    plt.show()
    plt.clf()
    plt.close()
    

    Plot Output