Search code examples
pythonpandasdatabasemachine-learningfeature-extraction

Efficiently calculate anomaly detection


i have some problem and i hope you can help me thanks!!!!

i have a table looks like that:

Computer Data Count
A 01/01/2021 43
A 02/01/2021 64
A 03/01/2021 333
A 04/01/2021 656
B 01/01/2021 41
B 02/01/2021 436
B 03/01/2021 745
B 04/01/2021 234

I would like to run isolation forest algorithm only on part of the table

i don't what to do it manually like df[df['Computer'] == A]['Count'] for every Computer there are like 500 different Computers. so i don't what to do this:

scaler = StandardScaler()
np_scaled = scaler.fit_transform(df[df['Computer'] == A]['Count'].values.reshape(-1, 1))
data = pd.DataFrame(np_scaled)

# train isolation forest
model =  IsolationForest(contamination=float(.01))
model.fit(data)
df['anomaly'] = model.predict(data) 

500 times (for A and B and C and More) there is way to do it Efficiently thanks!!!

As a result, it should look like this but every time its check anomaly only for A separately, B separately and so on

Computer Data Count anomaly
A 01/01/2021 43 1
A 02/01/2021 64 1
A 03/01/2021 333 1
A 04/01/2021 656 -1
B 01/01/2021 41 1
B 02/01/2021 436 1
B 03/01/2021 745 1
B 04/01/2021 234 1

Solution

  • You could group by Computer and use transform to execute the function you already have over each group returning the same indexes as the original to the anomaly column.

    def train_isolation_group(group_count):
        scaler = StandardScaler()
        np_scaled = scaler.fit_transform(group_count.values.reshape(-1, 1))
        data = pd.DataFrame(np_scaled)
    
        # train isolation forest
        model =  IsolationForest(contamination=float(.01))
        model.fit(data)
        return model.predict(data)
    
    df['anomaly'] = df.groupby('Computer')['Count'].transform(train_isolation_group)