python data-science data-analysis outliers

Outliers in Categorical Data?

I am unable to find a solution to find outliers in categorical data. My data consists of combinations of rows. I want to mark outliers that differ in certain combinations.

In the above question as specified, I cannot cluster the data as a nonoutlier data row and the outlier row consisting of the same frequency.

My data looks something like this:

	c1	c2	c3	c4
row1	A	B	C	D
row2	A	B	C	D
row3	A	D	C	G
row4	NU	D	E	G
row6	NU	D	E	X

Please suggest a valid logic to solve the issue. I also tried to distribute the data based on frequency but I'm unable to assign a threshold as I'm unable to find a value to consider the data as outliers. Providing a way to find thresholds also can help.

Solution

There are no outlier detection methods for categorical data. The notion means nothing in this case. You might think like that:

You have a sample of 10 with 9 females and 1 male. You might think the male is the outlier it's just the composition of your sample, not an outlier.

For an outlier to exist there must be a measure of distance between the items. Have a look at this for more information.

Please suggest a valid logic to solve the issue. I Also tried to distribute the data based on frquency but i'm unable to assign a thresold as im unable to find a value to consider the data as outliers.Providing a way to find thresold also can help.

A solution could be to just value_counts your column so then you have the frequency of each element.