I have a dataframe, in which here is a column that contains different colors. My purpose is to count how many different colors are in this column. Note: "Red, 1", "1 Red", "Red 2" are all Red, so all above count as only one color. however, "Dark Red" is a different color
Item Color
Flower Red 1
Flower Yellow 1
Flower Red, 1
Flower Red 2
Flower 2 red
Flower Green, 1
Flower Dark Red 1
Flower Green, 2
Flower Black
Flower White 1
Flower 1A, Green
Because in this column, the structures of the color names are not identical. Hence I am not able to just use sub() to delete everything after the first word and count the number. I have tried to use sapply(apply()) or grep() for approximate matching, but the result is not that ideal. I also tried to distinct() to remain all the unique color name but it fails in dealing with strings like "Red, 1" and "Red, 2" For me, the most difficult situation is the color follows a comma, such as "1A, Blue" or something else
I hope the result could be
Flower Red
Flower Yellow
Flower Green
Flower Dark Red
Flower Black
Flower White
Or, simpler
6
Say the dataframe is called df
then you can use:
length(unique(toupper(gsub("[^a-zA-Z]", "", df$Color))))
Explanation:
gsub("[^a-zA-Z]", "", df$Color)
removes anything that is not a character. So you are left with just the colors.
toupper
capitalizes the remaining text, and unique removes duplicates. length
gives you the number of elements in the vector. You will get 5
.
Addition: To handle the case of "Red, 1A" you can remove everything after the ',' and then use the above solution. For example:
x <- gsub(",.*$", "", df$Color) ## removes a comma and anything that follows
length(unique(toupper(gsub("[^a-zA-Z]", "", x)))) ## removes all but letters
Alternative to removing everything after a comma: You could remove any alpha numeric such as "1A" or "A1" with:
x <- gsub("[a-zA-Z]*[0-9][a-zA-Z]*", "", df$Color)