Search code examples
pythondata-sciencedata-analysisoutliers

Outliers in Categorical Data?


I am unable to find a solution to find outliers in categorical data. My data consists of combinations of rows. I want to mark outliers that differ in certain combinations.

In the above question as specified, I cannot cluster the data as a nonoutlier data row and the outlier row consisting of the same frequency.

My data looks something like this:

c1 c2 c3 c4
row1 A B C D
row2 A B C D
row3 A D C G
row4 NU D E G
row6 NU D E X

Please suggest a valid logic to solve the issue. I also tried to distribute the data based on frequency but I'm unable to assign a threshold as I'm unable to find a value to consider the data as outliers. Providing a way to find thresholds also can help.


Solution

  • There are no outlier detection methods for categorical data. The notion means nothing in this case. You might think like that:

    You have a sample of 10 with 9 females and 1 male. You might think the male is the outlier it's just the composition of your sample, not an outlier.

    For an outlier to exist there must be a measure of distance between the items. Have a look at this for more information.

    Please suggest a valid logic to solve the issue. I Also tried to distribute the data based on frquency but i'm unable to assign a thresold as im unable to find a value to consider the data as outliers.Providing a way to find thresold also can help.

    A solution could be to just value_counts your column so then you have the frequency of each element.