python machine-learning anomaly-detection

What algorithms and libraries to use to process sensor data

Sorry for cross posting, I have no answer on Cross-validated

I am at the very beginning in data science. I have data from sensors (20) and almost all time I have "good" values. Sometimes, I can find that something is wrong. Now I have 500,000 rows, each row contains 20 columns and about 300 are for "bad" rows. These "bad" rows can represent different kinds of errors and sometimes have no values. I do not know how many types of error I will have.

Since I do not have enough "bad" data, I cannot use neural networks directly.

My intention is to use outlier/anomaly detection algorithm, do clustering using these anomalies and manually assign errors to each cluster.

What algorithms and python libraries can you recommend? Any help will be appreciated.

Solution

This is a common problem in outlier and anomaly detection, and there are several strategies that are established for this kind of analysis.

Autoencoders: check out this post on using autoencoders for fraud detection https://medium.com/@curiousily/credit-card-fraud-detection-using-autoencoders-in-keras-tensorflow-for-hackers-part-vii-20e0c85301bd

And this repo: https://github.com/chen0040/keras-anomaly-detection

My best shot at paraphrasing how this approach works: they take the inputs apart and put them back together with only the fundements, looking for inputs that are fundamentally different from normal.

Here's an approach that focuses on leveraging LTSM, a popular kind of "memory" cell in convolutional neural networks: https://developer.ibm.com/tutorials/iot-deep-learning-anomaly-detection-5/

You might also explore GANs, since they fundamentally depend on a discriminator. Check them out here: https://skymind.ai/wiki/generative-adversarial-network-gan

There are lots of NN/ML libraries in Python. Keras, tf, scikit-learn, pytorch, and nltk, spacy are all popular.