Search code examples
rstringapplylapplysapply

R function for count number while deleting the duplicates


I have a dataframe, in which here is a column that contains different colors. My purpose is to count how many different colors are in this column. Note: "Red, 1", "1 Red", "Red 2" are all Red, so all above count as only one color. however, "Dark Red" is a different color

Item               Color
Flower             Red 1
Flower             Yellow 1
Flower             Red, 1
Flower             Red 2
Flower             2 red
Flower             Green, 1
Flower             Dark Red 1
Flower             Green, 2
Flower             Black
Flower             White 1
Flower             1A, Green

Because in this column, the structures of the color names are not identical. Hence I am not able to just use sub() to delete everything after the first word and count the number. I have tried to use sapply(apply()) or grep() for approximate matching, but the result is not that ideal. I also tried to distinct() to remain all the unique color name but it fails in dealing with strings like "Red, 1" and "Red, 2" For me, the most difficult situation is the color follows a comma, such as "1A, Blue" or something else

I hope the result could be

Flower    Red
Flower    Yellow
Flower    Green
Flower    Dark Red
Flower    Black
Flower    White

Or, simpler

6

Solution

  • Say the dataframe is called df then you can use:

    length(unique(toupper(gsub("[^a-zA-Z]", "", df$Color))))

    Explanation:

    gsub("[^a-zA-Z]", "", df$Color) removes anything that is not a character. So you are left with just the colors.

    touppercapitalizes the remaining text, and unique removes duplicates. length gives you the number of elements in the vector. You will get 5.

    Addition: To handle the case of "Red, 1A" you can remove everything after the ',' and then use the above solution. For example:

    x <- gsub(",.*$", "", df$Color) ## removes a comma and anything that follows

    length(unique(toupper(gsub("[^a-zA-Z]", "", x)))) ## removes all but letters

    Alternative to removing everything after a comma: You could remove any alpha numeric such as "1A" or "A1" with:

    x <- gsub("[a-zA-Z]*[0-9][a-zA-Z]*", "", df$Color)