Search code examples
pythonpandasdata-preprocessing

Pandas - Generate sequences of consecutive rows (by timestamp) over grouped by values


I am working on an LSTM-based network where I need to model sequences of occurances over other column's values using Pandas where each sequence must be limited by length.

A practical usecase for this, I have multiple machines with logs, the logs are marked with a title and a timestamp (for the sake of the example, t1 < t2 < t3 ...), the initial dataframe looks like this:

d = {'timestamp': ['t1', 't2', 't1', 't3', 't2', 't2', 't1'], 
     'machine': ['M1', 'M2', 'M2', 'M1', 'M2', 'M1', 'M3'], 
     'log': ['A', 'B', 'A', 'C', 'A', 'A', 'B']}

df = pd.DataFrame(d)
print(df.head(7))

  timestamp machine log
0        t1      M1   A
1        t2      M2   B
2        t1      M2   A
3        t3      M1   C
4        t2      M2   A
5        t2      M1   A
6        t1      M3   B

What I want to get is a dataframe with sequences of up to size max_len = 2 for each machine. The desired output should look like this:

max_len = 2

  machine sequence
0      M1   [A, A]  # index from original df: [0, 5]
1      M1   [A, C]  # index from original df: [5, 3]
2      M2   [A, A]  # index from original df: [2, 4]
3      M2   [A, B]  # index from original df: [4, 1]
4      M3      [B]  # index from original df: [6]

The sequences are limited by max_len = 2 and their elements are ordered by timestamp.

max_len = 3

  machine   sequence
0      M1  [A, A, C]  # index from original df: [0, 5, 3]
1      M2  [A, A, B]  # index from original df: [2, 4, 1]
2      M3        [B]  # index from original df: [6]

The sequences are limited by max_len = 3 and their elements are ordered by timestamp.

Note: max_len parameter is an upper bound for the sequence length, I will pad short sequences (like M3's) to fit LSTM requirements.

Note 2: I am actually grouping by 2 columns, but for this example to be as minimum as I can, I only included one column.

What I tried so far:

I was using PySpark up until now but I did it incorrectly by incrementally using the F.lag function. that left many useless partial sequences from which I cannot identify the short sequences that needs padding, also this naive approach is slow and basically not that good.

w = Window.repartition('machine').orderBy('timestamp')
for i in range(max_len):
   df = df.withColumn(f"log_lag_{i}", F.lag('log', i-1).over(w))

I would appreciate help how to deal with this using Pandas, I have been trying for a long time and failed.

Thank you!


Solution

  • Lets Try itertools

    import itertools
    df=(df.assign(log1=df.groupby('machine')['log'].apply(lambda x: list(sorted(i) for i in (itertools.combinations(x, 2))))# get sorted tuple combinations
                  .explode()# Explode them into rows
                  .reset_index(drop=True)#Drop index
                  .combine_first(df['log'])#Update the new column where there is a null value
                  .astype(str)#Convert the lists into string
                 ).drop_duplicates(subset=['log','log1'])#drop duplicates
                 .drop('timestamp',1)#drop column
       )
    
       machine log        log1
    0      M1   A  ['A', 'C']
    1      M2   B  ['A', 'A']
    3      M1   C  ['A', 'B']
    4      M2   A  ['A', 'B']
    5      M1   A  ['A', 'A']
    6      M3   B           B