I am working on an LSTM-based network where I need to model sequences of occurances over other column's values using Pandas
where each sequence must be limited by length.
A practical usecase for this, I have multiple machines with logs, the logs are marked with a title and a timestamp (for the sake of the example, t1 < t2 < t3 ...
), the initial dataframe looks like this:
d = {'timestamp': ['t1', 't2', 't1', 't3', 't2', 't2', 't1'],
'machine': ['M1', 'M2', 'M2', 'M1', 'M2', 'M1', 'M3'],
'log': ['A', 'B', 'A', 'C', 'A', 'A', 'B']}
df = pd.DataFrame(d)
print(df.head(7))
timestamp machine log
0 t1 M1 A
1 t2 M2 B
2 t1 M2 A
3 t3 M1 C
4 t2 M2 A
5 t2 M1 A
6 t1 M3 B
What I want to get is a dataframe with sequences of up to size max_len = 2
for each machine.
The desired output should look like this:
max_len = 2
machine sequence
0 M1 [A, A] # index from original df: [0, 5]
1 M1 [A, C] # index from original df: [5, 3]
2 M2 [A, A] # index from original df: [2, 4]
3 M2 [A, B] # index from original df: [4, 1]
4 M3 [B] # index from original df: [6]
The sequences are limited by max_len = 2
and their elements are ordered by timestamp
.
max_len = 3
machine sequence
0 M1 [A, A, C] # index from original df: [0, 5, 3]
1 M2 [A, A, B] # index from original df: [2, 4, 1]
2 M3 [B] # index from original df: [6]
The sequences are limited by max_len = 3
and their elements are ordered by timestamp
.
Note: max_len
parameter is an upper bound for the sequence length, I will pad short sequences (like M3
's) to fit LSTM requirements.
Note 2: I am actually grouping by 2 columns, but for this example to be as minimum as I can, I only included one column.
What I tried so far:
I was using PySpark
up until now but I did it incorrectly by incrementally using the F.lag
function. that left many useless partial sequences from which I cannot identify the short sequences that needs padding, also this naive approach is slow and basically not that good.
w = Window.repartition('machine').orderBy('timestamp')
for i in range(max_len):
df = df.withColumn(f"log_lag_{i}", F.lag('log', i-1).over(w))
I would appreciate help how to deal with this using Pandas
, I have been trying for a long time and failed.
Thank you!
Lets Try itertools
import itertools
df=(df.assign(log1=df.groupby('machine')['log'].apply(lambda x: list(sorted(i) for i in (itertools.combinations(x, 2))))# get sorted tuple combinations
.explode()# Explode them into rows
.reset_index(drop=True)#Drop index
.combine_first(df['log'])#Update the new column where there is a null value
.astype(str)#Convert the lists into string
).drop_duplicates(subset=['log','log1'])#drop duplicates
.drop('timestamp',1)#drop column
)
machine log log1
0 M1 A ['A', 'C']
1 M2 B ['A', 'A']
3 M1 C ['A', 'B']
4 M2 A ['A', 'B']
5 M1 A ['A', 'A']
6 M3 B B