I have a dataset representing a time series The time series has 165 events
Each unique event has 15 rows, FX1, FX2, … FX15
The time between each row is 1 minute so one event takes 15 minutes
Each row has 6 features F1,F2,F3,T1,T2,T3 All values are numeric and integers values
Here is an example of an event :
This event started at 2000/01/01 00:00:00 and ended at 2000/01/01 00:15:00
The first row in the event has F1=3,F2=-4,…T3=45
time F1 F2 F3 T1 T2 T3
0 2000/01/01 00:01:00 94 -76 0 47 9 -20
1 2000/01/01 00:02:00 -2 85 14 79 92 -95
2 2000/01/01 00:03:00 -3 13 -100 33 74 -43
3 2000/01/01 00:04:00 39 64 -29 32 -73 -44
4 2000/01/01 00:05:00 80 44 3 73 56 -100
5 2000/01/01 00:06:00 -19 -51 -77 32 72 24
6 2000/01/01 00:07:00 79 -69 -87 4 20 19
7 2000/01/01 00:08:00 68 6 95 -76 34 58
8 2000/01/01 00:09:00 26 -59 24 79 -43 48
9 2000/01/01 00:10:00 71 8 -85 -15 -45 -56
10 2000/01/01 00:11:00 51 98 6 -53 -39 5
11 2000/01/01 00:12:00 99 73 -48 -1 64 56
12 2000/01/01 00:13:00 -12 13 -63 51 36 95
13 2000/01/01 00:14:00 8 -100 54 91 -56 -32
My question is this
Can I convert one event to just one row without losing information?
Let's say I use Panda data frame resample() method for downsampling
Think of it like resampling a dataset in this case resampling to 15 minutes will create a unique row I think but will the resulting unidimensional data lack any information?
So the desired output for the above data frame will be something like this
time F1 F2 F3 T1 T2 T3
2000/01/01 00:01:00 XF1 XF2 XF3 xT1 xT2 xT3
Where XF1 is the value that best represents the array of 14 values that F1 took from 0 to 14 ( 94, -2, -3, 39, 80, -19, 79, ....8).
With the following toy dataframe:
import random
import numpy as np
import pandas as pd
df = pd.DataFrame(
{"time": [f"2000/01/01 00:{i:02}:00" for i in range(1, 60)]}
| {
key: [random.randrange(-100, 100) for _ in range(1, 60)]
for key in ("F1", "F2", "F3", "T1", "T2", "T3")
}
)
print(df.head(30))
# Output
time F1 F2 F3 T1 T2 T3
0 2000/01/01 00:01:00 44 -91 5 -23 97 14
1 2000/01/01 00:02:00 -38 46 -14 -5 -66 39
2 2000/01/01 00:03:00 70 -63 -28 -53 53 77
3 2000/01/01 00:04:00 33 -16 82 98 54 95
4 2000/01/01 00:05:00 -51 -89 -52 -88 -68 -61
5 2000/01/01 00:06:00 -64 69 25 -98 21 63
6 2000/01/01 00:07:00 -52 51 -34 35 -47 83
7 2000/01/01 00:08:00 -10 10 -87 -49 75 7
8 2000/01/01 00:09:00 -51 -95 25 -49 -43 -13
9 2000/01/01 00:10:00 -16 88 -23 -3 -17 71
10 2000/01/01 00:11:00 4 -97 3 53 -35 -83
11 2000/01/01 00:12:00 -94 -17 -88 -5 41 60
12 2000/01/01 00:13:00 91 -14 43 79 -8 14
13 2000/01/01 00:14:00 94 -1 -57 7 -21 91
14 2000/01/01 00:15:00 -60 -2 39 -56 -61 24
15 2000/01/01 00:16:00 -20 -83 30 68 -97 -87
16 2000/01/01 00:17:00 7 70 -65 49 13 -66
17 2000/01/01 00:18:00 29 -70 78 84 -80 -5
18 2000/01/01 00:19:00 57 -57 -78 -75 29 -12
19 2000/01/01 00:20:00 -1 -48 -91 89 25 88
20 2000/01/01 00:21:00 -60 -90 6 34 -77 34
21 2000/01/01 00:22:00 -28 7 -33 -64 42 56
22 2000/01/01 00:23:00 -29 85 45 29 -20 -38
23 2000/01/01 00:24:00 40 -26 17 18 50 -100
24 2000/01/01 00:25:00 -74 60 -50 -3 81 -91
25 2000/01/01 00:26:00 35 47 -90 19 48 -47
26 2000/01/01 00:27:00 -32 34 -43 33 26 26
27 2000/01/01 00:28:00 74 12 -11 -97 -20 -29
28 2000/01/01 00:29:00 58 -90 -7 -88 29 -89
29 2000/01/01 00:30:00 39 -51 -88 -94 -26 -27
Here is one way to do it:
df["time"] = pd.to_datetime(df["time"], format="%Y/%m/%d %H:%M:%S")
new_df = (
df.set_index("time").resample("15T").agg(lambda x: int(np.mean(x)))
) # 15 min. resampling
Then:
print(new_df)
# Output
F1 F2 F3 T1 T2 T3
time
2000-01-01 00:00:00 -2 -15 -14 -7 2 32
2000-01-01 00:15:00 0 -10 -16 2 0 -22
2000-01-01 00:30:00 7 -3 -13 -12 0 24
2000-01-01 00:45:00 -1 0 -5 -23 -12 -23
Without more context, it's impossible to determine what would be the best representation of the resampled values, so I chose the mean value, but you can replace the lambda function with anything more suitable.
Also, I don't think you will be able to avoid losing some information, as resampling/aggregating necessarily comes at a cost.