I'm trying to help a non profit with some data analysis. Their spreadsheets look like this:
The 'S1' stand for sensor and there are two readings (temp & Humidity), which are taken three times a day (6 12 6). How do I make columns to get data for 'S1', like daily avg, weekly, avg, monthly avg for both temp and humidity? I might be in over my head on this one. I can use pandas for basic stuff but I can't figure out how to combine columns under one heading of 'S1'? Any help is greatly appreciated.
This is as far as i've got. Not sure I'm on the right path or not?
Your dataframe is not so bad after all. You have only one exception on lines (2, 3) vs 7, 11, 15, etc. Another minor problem is for hours: I assume hours are 6AM, 12PM and 6PM, so they need to be converted as 6, 12 and 18 in 24-Hour time format. If my assumptions are right and your screenshot contains no more exception, you can clean your data like this:
# Read data file without header/index and don't let pandas infers the dtype
df = pd.read_csv('TestData.csv', header=None, dtype=str)
m = df[1].notna() & df[0].isna()
# Create datetime index (rows)
dt = (pd.to_datetime(df[0].ffill() + ' ' + df[1], format='%d-%b-%y %H')
.rename('DateTime')[m])
dt[2::3] += pd.DateOffset(hours=12) # convert 6PM to 18
# Create multi-index headers (columns)
cols = df.loc[[0, 1]].ffill(axis=1).dropna(how='all', axis=1).T
mi = pd.MultiIndex.from_frame(cols, names=['Sensor', 'Metric'])
# Final output (remove two first columns)
out = df.loc[m, 2:].set_axis(dt, axis=0).set_axis(mi, axis=1).astype(float)
Output:
>>> out
Sensor S1 S2 S3
Metric T % T % T %
DateTime
2023-08-01 06:00:00 30.3 87.0 29.4 86.0 29.5 65.0
2023-08-01 12:00:00 30.4 89.0 29.4 87.0 29.4 69.0
2023-08-01 18:00:00 NaN NaN NaN NaN NaN NaN
2023-08-02 06:00:00 29.7 93.0 29.0 88.0 28.5 70.0
2023-08-02 12:00:00 29.5 93.0 28.8 88.0 28.4 70.0
2023-08-02 18:00:00 31.7 93.0 30.3 88.0 30.3 69.0
...
>>> out.index
DatetimeIndex(['2023-08-01 06:00:00', '2023-08-01 12:00:00',
'2023-08-01 18:00:00', '2023-08-02 06:00:00',
'2023-08-02 12:00:00', '2023-08-02 18:00:00'],
dtype='datetime64[ns]', name='DateTime', freq=None)
>>> out.columns
MultiIndex([('S1', 'T'),
('S1', '%'),
('S2', 'T'),
('S2', '%'),
('S3', 'T'),
('S3', '%')],
names=['Sensor', 'Metric'])
Note: To correctly convert date, your locale must be English. As my locale is French, I have to switch to English like this:
import locale
locale.setlocale(locale.LC_TIME, 'en_US.utf8') # or 'C'
# 'en_US.utf8' is probably only valid on Linux and OSX, not Windows