Search code examples
pythonnumpyanomaly-detectionrandom-data

How would I generate a random data series in Python with events?


I'm trying to generate a random data series (or a time series) for anomaly detection, with events spanning a few consecutive data points. They could be values above/below a certain threshold, or anomaly types with different known probabilities.

e.g. in a case where 1 is normal and event types are within [2, 3, 4]: 11112221113333111111112211111

I looked through the np.random and random methods, but couldn't find any that generate these events. My current solution is picking random points, adding random durations to them to generate event start and end positions, labeling each event with a random event type, and joining back to the dataset, something like:

import numpy as np
num_events = np.random.randint(1, 10)
number_series = [1]*60
first_pos = 0 
event_starts = sorted([first_pos + i for i in np.random.randint(50, size = num_events)])
event_ends = [sum(i) for i in list(zip(event_starts, np.random.randint(8, size = num_events)))]
for c in list(zip(event_starts, event_ends)):
    rand_event_type  = np.random.choice(a = [2, 3, 4], p = [0.5, 0.3, 0.2])
    number_series[c[0]:c[1]] = [rand_event_type]*len(number_series[c[0]:c[1]])
print(number_series)

[1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 3, 3, 3, 3, 3, 3, 3, 1, 1, 1, 1, 3, 3, 4, 4, 4, 4, 4, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 4, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

But I'm wondering if there is a simpler way to just generate a series of numbers with events, based on a set of probabilities.


Solution

  • It all depends on how you model your process (the underlying process you want to simulate). You can read more about some of the usual models on Wikipedia.

    Simplest

    In the following, we use a very simple model (slightly different than yours): events each have a probability (like in your question) and a random duration that is independent of the event itself. 1 ("normal") is an event like any others (unlike your sample code). We could change that, but right now this is one of the simplest models you can think of.

    def gen_events(n):
        events = np.random.choice(a=[1, 2, 3, 4], p=[0.6, 0.2, 0.12, 0.08], size=n)
        durations = np.random.randint(1, 8, size=n)
        return np.repeat(events, durations)
    
    np.random.seed(0)  # repeatable example
    number_series = gen_events(10)  # for example
    
    >>> number_series
    array([1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1,
           1, 2, 2, 1, 1, 1, 1, 1, 1, 3, 4, 4, 1, 1, 1, 1, 1])
    

    Note, this is very fast:

    %timeit gen_events(1_000_000)
    # 44.9 ms ± 138 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
    

    Markov chain

    Another model (easier to parameterize, a bit more complex to implement) would be a Markov model. The simplest of them would be a Markov chain. Here is a super simple (but not very efficient) version:

    def markov_chain(P, n, initial_state=0):
        m = P.shape[0]
        ix = np.arange(m)
        s = np.empty(n, dtype=int)
        s[0] = initial_state
        for i in range(1, n):
            s[i] = np.random.choice(ix, p=P[s[i-1]])
        return s
    

    Above, P is a transition matrix, where each cell P[i,j] is the probability to transition from state i to state j. Here is an example application:

    P = np.array([
        [.7, .1, .12, .08],  # from 0 to others
        [.3, .6, .05, .05],
        [.3, 0, .65, .05],
        [.4, 0, .05, .55],
    ])
    
    np.random.seed(0)
    n = 100
    s = markov_chain(P, n) + 1
    >>> s
    array([1, 1, 2, 2, 2, 2, 2, 2, 2, 4, 1, 2, 2, 2, 3, 1, 1, 1, 3, 3, 3, 4,
           4, 4, 4, 1, 1, 1, 4, 4, 3, 1, 2, 2, 2, 1, 1, 1, 1, 4, 4, 1, 1, 1,
           1, 1, 1, 1, 1, 1, 1, 1, 1, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
           1, 3, 1, 3, 1, 4, 4, 4, 4, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
           1, 1, 4, 1, 1, 1, 2, 1, 1, 1, 1, 3])
    

    Note that the unigram probability of each event is called pi and corresponds to any of the rows of lim_{k -> \infty} P**k:

    >>> pd.Series(markov_chain(P, 1000, 0)).value_counts(normalize=True).sort_index()
    0    0.530
    1    0.135
    2    0.209
    3    0.126
    
    >>> np.linalg.matrix_power(P, 40)[0]
    array([0.52188552, 0.13047138, 0.21632997, 0.13131313])