Search code examples
pythonmachine-learningtime-seriesknnanomaly-detection

Multiple multivariate timeseries supervised anomalies prediction


I have multiple timeseries datasets. Each dataset represents a manufacturing process, has 36000 rows and 4 columns, is labeled, and some of them contain anomalies. There is:

  • A Timestamp index (values are recorded each second for 10h), this index is basically the same for every dataset.
  • 2 features: Flow and Pressure, which are partly correlated (and i'd like to take this information into account if possible)
  • 2 labels: Flow anomaly and Pressure anomaly. They = 1 if value is an anomaly, 0 if not.

I want to train a machine learning model to identify anomalies on other timeseries data of the same kind. I'd also like to predict anomalies. I'd like to train an Isolation Forest model, or a KNN or neural network amongst others.

But I am having trouble handling multiple multivariate and multi-label timeseries data.

I tried a library called Darts in python, made to handle this kind of problem, but I don't know how to train an Isolation Forest model on multivariate timeseries using this library, and I don't find it on the documentation.

My data is stored in csv files that I import as pandas dataframes. I use Pyhton 3.11.2.


Solution

  • This is a very general question and it is related more to research of what is the best and how to do it.

    First of all, Darts is great for time-series tasks, but it doesn't include Isolation Forest. scikit-learn on the other hand does so you need to use both.

    Below is a toy example to illustrate this:

    import pandas as pd
    import numpy as np
    from sklearn.ensemble import IsolationForest
    from sklearn.model_selection import train_test_split
    from sklearn.preprocessing import StandardScaler
    from sklearn.metrics import accuracy_score, f1_score
    
    # fake data
    np.random.seed(0)
    
    # fake timestamps
    timestamps = pd.date_range(start='2023-08-01', periods=36000, freq='1S')
    
    # fake Flow and Pressure
    data1 = pd.DataFrame({
        'Timestamp': timestamps,
        'Flow': np.random.normal(100, 10, 36000),
        'Pressure': np.random.normal(50, 5, 36000),
        'Flow anomaly': np.random.randint(0, 2, 36000),
        'Pressure anomaly': np.random.randint(0, 2, 36000),
    })
    
    data2 = pd.DataFrame({
        'Timestamp': timestamps,
        'Flow': np.random.normal(90, 8, 36000),
        'Pressure': np.random.normal(55, 6, 36000),
        'Flow anomaly': np.random.randint(0, 2, 36000),
        'Pressure anomaly': np.random.randint(0, 2, 36000),
    })
    
    data3 = pd.DataFrame({
        'Timestamp': timestamps,
        'Flow': np.random.normal(110, 12, 36000),
        'Pressure': np.random.normal(45, 4, 36000),
        'Flow anomaly': np.random.randint(0, 2, 36000),
        'Pressure anomaly': np.random.randint(0, 2, 36000),
    })
    
    # Concatenate all datasets
    combined_data = pd.concat([data1, data2, data3], ignore_index=True)
    combined_data['Timestamp'] = pd.to_datetime(combined_data['Timestamp'])
    combined_data.set_index('Timestamp', inplace=True)
    
    
    # split features and targets
    features = combined_data[['Flow', 'Pressure']]
    labels = combined_data[['Flow anomaly', 'Pressure anomaly']]
    
    # Split into train and test sets
    X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.2, random_state=0)
    
    # Scaling
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    
    # model fitting
    isolation_forest_flow = IsolationForest()
    isolation_forest_pressure = IsolationForest()
    
    isolation_forest_flow.fit(X_train_scaled)
    isolation_forest_pressure.fit(X_train_scaled)
    
    # predict the test set
    pred_flow = isolation_forest_flow.predict(X_test_scaled)
    pred_pressure = isolation_forest_pressure.predict(X_test_scaled)
    
    # predictions back to labels (0 == inliers, 1 == anomalies)
    pred_flow_labels = np.where(pred_flow == -1, 1, 0)
    pred_pressure_labels = np.where(pred_pressure == -1, 1, 0)
    
    # accuracy
    accuracy_flow = accuracy_score(y_test['Flow anomaly'], pred_flow_labels)
    accuracy_pressure = accuracy_score(y_test['Pressure anomaly'], pred_pressure_labels)
    
    
    print(f"Flow Accuracy: {accuracy_flow}")
    print(f"Pressure Accuracy: {accuracy_pressure}")
    

    The above prints:

    Flow Accuracy: 0.4999537037037037 
    Pressure Accuracy:0.4988888888888889