Search code examples
pandasmachine-learningtime-seriesdata-scienceartificial-intelligence

How to resampling o multidimensional event without losing information


I have a dataset representing a time series The time series has 165 events

Each unique event has 15 rows, FX1, FX2, … FX15

The time between each row is 1 minute so one event takes 15 minutes

Each row has 6 features F1,F2,F3,T1,T2,T3 All values are numeric and integers values

Here is an example of an event :

This event started at 2000/01/01 00:00:00 and ended at 2000/01/01 00:15:00

The first row in the event has F1=3,F2=-4,…T3=45


                   time  F1   F2   F3  T1  T2   T3
0   2000/01/01 00:01:00  94  -76    0  47   9  -20
1   2000/01/01 00:02:00  -2   85   14  79  92  -95
2   2000/01/01 00:03:00  -3   13 -100  33  74  -43
3   2000/01/01 00:04:00  39   64  -29  32 -73  -44
4   2000/01/01 00:05:00  80   44    3  73  56 -100
5   2000/01/01 00:06:00 -19  -51  -77  32  72   24
6   2000/01/01 00:07:00  79  -69  -87   4  20   19
7   2000/01/01 00:08:00  68    6   95 -76  34   58
8   2000/01/01 00:09:00  26  -59   24  79 -43   48
9   2000/01/01 00:10:00  71    8  -85 -15 -45  -56
10  2000/01/01 00:11:00  51   98    6 -53 -39    5
11  2000/01/01 00:12:00  99   73  -48  -1  64   56
12  2000/01/01 00:13:00 -12   13  -63  51  36   95
13  2000/01/01 00:14:00   8 -100   54  91 -56  -32

My question is this

Can I convert one event to just one row without losing information?

Let's say I use Panda data frame resample() method for downsampling

Think of it like resampling a dataset in this case resampling to 15 minutes will create a unique row I think but will the resulting unidimensional data lack any information?

So the desired output for the above data frame will be something like this

                 time  F1   F2   F3   T1   T2     T3
 2000/01/01 00:01:00  XF1  XF2  XF3  xT1  xT2    xT3

Where XF1 is the value that best represents the array of 14 values that F1 took from 0 to 14 ( 94, -2, -3, 39, 80, -19, 79, ....8).


Solution

  • With the following toy dataframe:

    import random
    import numpy as np
    import pandas as pd
    
    df = pd.DataFrame(
        {"time": [f"2000/01/01 00:{i:02}:00" for i in range(1, 60)]}
        | {
            key: [random.randrange(-100, 100) for _ in range(1, 60)]
            for key in ("F1", "F2", "F3", "T1", "T2", "T3")
        }
    )
    
    print(df.head(30))
    
    # Output
                       time  F1  F2  F3  T1  T2   T3
    0   2000/01/01 00:01:00  44 -91   5 -23  97   14
    1   2000/01/01 00:02:00 -38  46 -14  -5 -66   39
    2   2000/01/01 00:03:00  70 -63 -28 -53  53   77
    3   2000/01/01 00:04:00  33 -16  82  98  54   95
    4   2000/01/01 00:05:00 -51 -89 -52 -88 -68  -61
    5   2000/01/01 00:06:00 -64  69  25 -98  21   63
    6   2000/01/01 00:07:00 -52  51 -34  35 -47   83
    7   2000/01/01 00:08:00 -10  10 -87 -49  75    7
    8   2000/01/01 00:09:00 -51 -95  25 -49 -43  -13
    9   2000/01/01 00:10:00 -16  88 -23  -3 -17   71
    10  2000/01/01 00:11:00   4 -97   3  53 -35  -83
    11  2000/01/01 00:12:00 -94 -17 -88  -5  41   60
    12  2000/01/01 00:13:00  91 -14  43  79  -8   14
    13  2000/01/01 00:14:00  94  -1 -57   7 -21   91
    14  2000/01/01 00:15:00 -60  -2  39 -56 -61   24
    15  2000/01/01 00:16:00 -20 -83  30  68 -97  -87
    16  2000/01/01 00:17:00   7  70 -65  49  13  -66
    17  2000/01/01 00:18:00  29 -70  78  84 -80   -5
    18  2000/01/01 00:19:00  57 -57 -78 -75  29  -12
    19  2000/01/01 00:20:00  -1 -48 -91  89  25   88
    20  2000/01/01 00:21:00 -60 -90   6  34 -77   34
    21  2000/01/01 00:22:00 -28   7 -33 -64  42   56
    22  2000/01/01 00:23:00 -29  85  45  29 -20  -38
    23  2000/01/01 00:24:00  40 -26  17  18  50 -100
    24  2000/01/01 00:25:00 -74  60 -50  -3  81  -91
    25  2000/01/01 00:26:00  35  47 -90  19  48  -47
    26  2000/01/01 00:27:00 -32  34 -43  33  26   26
    27  2000/01/01 00:28:00  74  12 -11 -97 -20  -29
    28  2000/01/01 00:29:00  58 -90  -7 -88  29  -89
    29  2000/01/01 00:30:00  39 -51 -88 -94 -26  -27
    

    Here is one way to do it:

    df["time"] = pd.to_datetime(df["time"], format="%Y/%m/%d %H:%M:%S")
    new_df = (
        df.set_index("time").resample("15T").agg(lambda x: int(np.mean(x)))
    )  # 15 min. resampling
    

    Then:

    print(new_df)
    # Output
    
                         F1  F2  F3  T1  T2  T3
    time
    2000-01-01 00:00:00  -2 -15 -14  -7   2  32
    2000-01-01 00:15:00   0 -10 -16   2   0 -22
    2000-01-01 00:30:00   7  -3 -13 -12   0  24
    2000-01-01 00:45:00  -1   0  -5 -23 -12 -23
    

    Without more context, it's impossible to determine what would be the best representation of the resampled values, so I chose the mean value, but you can replace the lambda function with anything more suitable.

    Also, I don't think you will be able to avoid losing some information, as resampling/aggregating necessarily comes at a cost.