python machine-learning scikit-learn data-mining anomaly-detection

n_neighbor parameter of Local Outlier Factor affects to ROC-AUC

I am trying to solve the outlier detection problem with several algorithms. When I use Local Outlier Factor API of Scikit-learn, I have to input a very important parameter--n_neighbors. However, with different n_neighbors, I receive different ROC_AUC scores. For example, with n_neighbors=5 then ROC_AUC=56. However, with n_neighbors=6 then ROC_AUC=85; with n_neighbors=7 then ROC_AUC=94, etc. Formally, ROC_AUC is very high if n_neighbors>=6

I want to ask three questions: (1) Why the n_neighbors parameter of Local Outlier Factor affects to ROC-AUC? (2) How to choose an appropriate n_neighbors in an unsupervised learning setting? (3) Should I choose high n_neighbors to get a high ROC_AUC?

Solution

If the results would not be affected, the parameter would not be needed, right?

Considering more neighbors is more costly. But it also means more data is used, so I'm not surprised that results improve. Did you read the paper that explains what the parameter does?

When you are choosing the parameter based on the evaluation, then you are cheating. It is an unsupervised method - you are not supposed to have such labels in a real use case.