Search code examples
pythonpandasdataframefrequencyseries

Extracting data values belonging to a day and compute average value to each day


I have a dataset with a date range from January 12th to August 3rd 2018 with some values:

enter image description here

The dimensionality of df_luminosidad DataFrame is:

df_luminosidad.shape
(9752, 2)

Each row of data has a half-hour frequency. The first row begins at 2018-01-12

df_luminosidad.iloc[0]

Fecha:       2018-01-12 00:17:28
Luz (lux)                      1
Name: 0, dtype: object

And the last row ends at 2018-08-03

df_luminosidad.tail(1)

Fecha:  Luz (lux)
9751    2018-08-03 23:44:59 1

There are many rows which correspond to one day such as denoted above.

I want to create a new dataframe selecting only 1 day as a value for the Fecha: column (no a repeated date) and select the average value from all existing value for the selected date of the Luz(lux) column

This would look like this:

|  Fecha:     |  Luz(lux) - Average each day values |
|  2018-01-12 |  9183.479167                        |
|  2018-01-13 |  7431.8125                          |
|  2018-01-14 |  11073.1875                         |
|  2018-01-15 |  11456.25                           |
|      .             .                              |
|      .             .                              |
|      .             .                              |
|  2018-08-03 |  11331.229167                       |

I am using pd.date_range() function to create a dataframe with the values from df_luminosidad previous dataframe to a specified frequency such as denoted @piRSquared in this answer

From January 12 to August 03 there are 203 days (28 weeks), then I choose 204 like period attribute, and I am using a freq attribute calendar day frequency D

df = pd.DataFrame(dict(
    Date=pd.date_range('2018-01-12', periods=204, freq='D'),
    Value=df_luminosidad['Luz (lux)'].mean()   
))

Until here, the approach is well, but I have a doubt about of how to compute the mean of Luz (lux) values to each day selected with pd.date_range, because, at this moment I just get an average same value for all days from 2018-01-12 until 2018-08-03 such as follow:

|  Date       |  Value        |
|  2018-01-12 |  11228.888331 |                       
|  2018-01-13 |  11228.888331 |                         
|  2018-01-14 |  11228.888331 |                        
|  2018-01-15 |  11228.888331 |                          
|      .             .        |                      
|      .             .        |                      
|      .             .        |                      
|  2018-08-03 |  11331.229167 | 

I've already generated a dataframe by each Fecha: column values, and separately I could get their average, but this force me to read each file day individually.

How can I generate of a cyclic way grouping all values of a day into a only average value by each day and put them in one dataframe?


Solution

  • I believe need resample with mean or aggregate mean with Grouper:

    df_luminosidad['Fecha:'] = pd.to_datetime(df_luminosidad['Fecha:'])
    
    df = df_luminosidad.resample('D', on='Fecha:')['Luz (lux)'].mean().reset_index()
    

    Or:

    df = (df_luminosidad.groupby(pd.Grouper(key='Fecha:', freq='D'))['Luz (lux)']
                        .mean()
                        .reset_index())
    

    Another solutions with DatetimeIndex:

    df_luminosidad['Fecha:'] = pd.to_datetime(df_luminosidad['Fecha:'])
    df_luminosidad = df_luminosidad.set_index('Fecha:')
    
    
    df = df_luminosidad.resample('D')['Luz (lux)'].mean().reset_index()
    df = df_luminosidad.groupby(pd.Grouper(freq='D'))['Luz (lux)'].mean().reset_index()