python machine-learning data-science sequence sequential

How to simultaneously mine sequential patterns and infer variables from a set of sequences?

Given a set of sequences, I'm trying to extract a single sequence that captures features common to all sequences in the set.

I'm expressing my sequences as ordered lists of events, each event represented by a multiset or bag of event features.

Example of a sequence:

[{'a', 'y'}, {'H', 'b'}, {'C', 'F', 'c'}]

Example of an input set of sequences:

[
[{'a', 'y'}, {'H', 'b'}, {'x'}, {'C', 'F', 'c'}],
[{'a', 'd'}, {'b', 's'}, {'l'}, {'F', 'c', 'k'}],
[{'J', 'a'}, {'H', 'b'}, {'b'}, {'c', 'c', 'm'}],
[{'W', 'a'}, {'H', 'b'}, {'m'}, {'K', 'c', 'z'}],
[{'a', 'h', 'm'}, {'L', 'b', 'y'}, {'b', 'x'}, {'H', 'c', 'g'}]
]

If I were to input the above set of sequences, I would be expecting an output of

[{'a', '*'}, {'b', '*'}, {'*'}, {'c', '*', '*'}]]

Meaning that the first event of every sequence in the set contained an 'a', as well as one other variable character (represented here as the character '*', but the specific representation for variables doesn't matter). The second event of every sequence contained a 'b', as well as one other variable character. Between the second and last event, there always exists an event with at least one character. The last event always contains a 'c' as well at least two other characters.

Sequential Pattern Mining algorithms such as BIDE, PrefixSpan, Apriori and others seem to be able to extract patterns in sequences of immutable events, but do not seem to capture the desired event patterns as given in the example. I've tried PrefixSpan (prefixspan Python package) and SPADE (pycspade Python package) and didn't get the desired output given my example input.

Here's an example with prefexspan:

from prefixspan import PrefixSpan

# Each sequence is a list of tuples in this case,
# as the function requires a hashable type for the events. 
db = [[('y', 'a'), ('H', 'b'), ('x',), ('F', 'C', 'c')],
     [('a', 'd'), ('b', 's'), ('l',), ('F', 'k', 'c')],
     [('J', 'a'), ('H', 'b'), ('b',), ('m', 'c')],
     [('a', 'W'), ('H', 'b'), ('m',), ('z', 'K', 'c')],
     [('h', 'm', 'a'), ('b', 'y', 'L'), ('x', 'b'), ('g', 'H', 'c')]]

# Initialize PrefixSpan object.
ps = PrefixSpan(db)

# Get closed sequential pattern with a support of 5 
# (the number of sequences).
print(ps.frequent(5, closed=True))

>>[]

If I reduce the required support to 2, I get:

print(ps.frequent(2, closed=True))
>>[(3, [('H', 'b')])]

Solution

You don't need a fancy algorithm here. What you need here, believe it or not, is count vectorizing followed by min().

Count vectorizing is when you replace a sequence of words by a sequence of numbers, where the number's position indicates what kind of word it is, and the number indicates how many of those words there are. Here we're using letters instead of words, but the concept is the same.

For example, take the first event from each row.

('a', 'y')
('a', 'd')
('J', 'a')
('W', 'a')
('a', 'h', 'm')

Next, write a line containing each unique word. For each row, count how many times each word appears.

a y d J W h m
1 1 0 0 0 0 0
1 0 1 0 0 0 0
1 0 0 1 0 0 0
1 0 0 0 1 0 0
1 0 0 0 0 1 1

Next, the min() part. Take the minimum value of each column. a is the only column which has a minimum larger than zero. In other words, an a appears in every row.

Here's the code to implement this. The core of the algorithm is in calculate_common(). Everything else is just converting in and out of this count vectorized format. You could also use CountVectorizer from scikit-learn, which would be less code, but I chose to do it explicitly to show how it works internally.

import itertools
import numpy as np


dataset = [
    [('a', 'y'), ('H', 'b'), ('x'), ('C', 'F', 'c')],
    [('a', 'd'), ('b', 's'), ('l'), ('F', 'c', 'k')],
    [('J', 'a'), ('H', 'b'), ('b'), ('c', 'c', 'm')],
    [('W', 'a'), ('H', 'b'), ('m'), ('K', 'c', 'z')],
    [('a', 'h', 'm'), ('L', 'b', 'y'), ('b', 'x'), ('H', 'c', 'g')]
]


unique_letters = sorted(set(itertools.chain(*itertools.chain(*dataset))))
vocabulary = {val: i for i, val in enumerate(unique_letters)}
vocabulary_inverse = {i: val for val, i in vocabulary.items()}


def convert_letters_to_vocab_index(row):
    return [[vocabulary[letter] for letter in event] for event in row]


def sparse_to_dense(dataset_i):
    n_rows = len(dataset_i)
    n_events = len(dataset_i[0])
    n_words_in_vocab = len(vocabulary)
    ret = np.zeros((n_rows, n_events, n_words_in_vocab), dtype='int')
    for i in range(n_rows):
        for j in range(n_events):
            event = dataset_i[i][j]
            for vocab_index in event:
                ret[i, j, vocab_index] += 1
    return ret


def calculate_common(dataset_dense):
    return np.min(dataset_dense, axis=0)


def common_letters_dense_to_sparse(common_letters_dense):
    ret = []
    for event_dense in common_letters_dense:
        event_match = []
        for vocab_index in event_dense.nonzero()[0]:
            event_match.append(vocabulary_inverse[vocab_index])
        ret.append(event_match)
    return ret


def calculate_minimum_length(dataset_dense):
    # What are the minimum number of letters in each event?
    return dataset_dense.sum(axis=2).min(axis=0)


def pad_to_min(common_letters, minimum_length):
    for event, length in zip(common_letters, minimum_length):
        while len(event) < length:
            event.append('*')
    return common_letters


dataset_i = [convert_letters_to_vocab_index(row) for row in dataset]
dataset_dense = sparse_to_dense(dataset_i)
minimum_length = calculate_minimum_length(dataset_dense)
common_letters_dense = calculate_common(dataset_dense)
common_letters = common_letters_dense_to_sparse(common_letters_dense)
print(pad_to_min(common_letters, minimum_length))

Produces this output:

[['a', '*'], ['b', '*'], ['*'], ['c', '*', '*']]