Search code examples
pandasmultiple-columnsdata-cleaning

How to clean csv data over multiple columns?


I'm trying to help a non profit with some data analysis. Their spreadsheets look like this:

enter image description here

The 'S1' stand for sensor and there are two readings (temp & Humidity), which are taken three times a day (6 12 6). How do I make columns to get data for 'S1', like daily avg, weekly, avg, monthly avg for both temp and humidity? I might be in over my head on this one. I can use pandas for basic stuff but I can't figure out how to combine columns under one heading of 'S1'? Any help is greatly appreciated.

enter image description here

This is as far as i've got. Not sure I'm on the right path or not?


Solution

  • Your dataframe is not so bad after all. You have only one exception on lines (2, 3) vs 7, 11, 15, etc. Another minor problem is for hours: I assume hours are 6AM, 12PM and 6PM, so they need to be converted as 6, 12 and 18 in 24-Hour time format. If my assumptions are right and your screenshot contains no more exception, you can clean your data like this:

    # Read data file without header/index and don't let pandas infers the dtype
    df = pd.read_csv('TestData.csv', header=None, dtype=str)
    m = df[1].notna() & df[0].isna()
    
    # Create datetime index (rows)
    dt = (pd.to_datetime(df[0].ffill() + ' ' + df[1], format='%d-%b-%y %H')
            .rename('DateTime')[m])
    dt[2::3] += pd.DateOffset(hours=12)  # convert 6PM to 18
    
    # Create multi-index headers (columns)
    cols = df.loc[[0, 1]].ffill(axis=1).dropna(how='all', axis=1).T
    mi = pd.MultiIndex.from_frame(cols, names=['Sensor', 'Metric'])
    
    # Final output (remove two first columns)
    out = df.loc[m, 2:].set_axis(dt, axis=0).set_axis(mi, axis=1).astype(float)
    

    Output:

    >>> out
    Sensor                 S1          S2          S3      
    Metric                  T     %     T     %     T     %
    DateTime                                               
    2023-08-01 06:00:00  30.3  87.0  29.4  86.0  29.5  65.0
    2023-08-01 12:00:00  30.4  89.0  29.4  87.0  29.4  69.0
    2023-08-01 18:00:00   NaN   NaN   NaN   NaN   NaN   NaN
    2023-08-02 06:00:00  29.7  93.0  29.0  88.0  28.5  70.0
    2023-08-02 12:00:00  29.5  93.0  28.8  88.0  28.4  70.0
    2023-08-02 18:00:00  31.7  93.0  30.3  88.0  30.3  69.0
    ...
    
    >>> out.index
    DatetimeIndex(['2023-08-01 06:00:00', '2023-08-01 12:00:00',
                   '2023-08-01 18:00:00', '2023-08-02 06:00:00',
                   '2023-08-02 12:00:00', '2023-08-02 18:00:00'],
                  dtype='datetime64[ns]', name='DateTime', freq=None)
    
    >>> out.columns
    MultiIndex([('S1', 'T'),
                ('S1', '%'),
                ('S2', 'T'),
                ('S2', '%'),
                ('S3', 'T'),
                ('S3', '%')],
               names=['Sensor', 'Metric'])
    

    Note: To correctly convert date, your locale must be English. As my locale is French, I have to switch to English like this:

    import locale
    locale.setlocale(locale.LC_TIME, 'en_US.utf8')  # or 'C'
    # 'en_US.utf8' is probably only valid on Linux and OSX, not Windows