machine-learning anomaly-detection bigdata

Detect common features in multidimensional data

I am designing a system for anomaly detection.

There are multiple approaches for building such system. I choose to implement one facet of such system by detection of features shared by the majority of samples. I acknowledge the possible insufficiencies of such method but for my specific use-case: (1) It suffices to know that a new sample contains (or lacks) features shared by the majority of past data to make a quick decision.(2) I'm interested in the insights such method will offer to the data.

So, here is the problem:

Consider a large data set with M data points, where each data point may include any number of {key:value} features. I choose to model a training dataset by grouping all the features observed in the data (the set of all unique keys) and setting it as the model's feature space. I define each sample by setting its values for existing keys and None for values in features it does not include.

Given this training data set I want to determine which features reoccur in the data; and for such reoccurring features, do they mostly share a single value.

My question:

A simple solution would be to count everything - for each of the N features calculate the distribution of values. However as M and N are potentially large, I wonder if there is a more compact way to represent the data or more sophisticated method to make claims about features' frequencies.

Am I reinventing an existing wheel? If there's an online approach for accomplishing such task it would be even better.

Solution

If I understand correctly your question, you need to go over all the data anyway, so why not using hash? Actually two hash tables:

Inner hash table for the distribution of feature values.
Outer hash table for feature existence.

In this way, the size of the inner hash table will indicate how is the feature common in your data, and the actual values will indicate how they differ one another. Another thing to notice is that you go over your data only once, and the time complexity for every operation (almost) on hash tables (if you allocate enough space from the beginning) is O(1).

Hope it helps