Search code examples
rdata-cleaningdata-wrangling

How to drop rows in R with almost the same column value?


I have a dataset with a column of names. I would like to drop rows with the lesser "P" value, if there exists one with a higher value. For example, in the dataset below, I would like to drop the row ID's 3 and 5 since there exists a 'Texas P5' and a 'North Dakota P9.' What is the best way to do this? Thanks in advance!

ID Name Score
1 Minnesota P2 342
2 Vermont P7 342
3 Texas P4 65
4 New Mexico 643
5 North Dakota P8 78
6 North Dakota P9 245
7 Texas P5 856
8 Minnesota LP 342

Solution

  • Here is a base R way. Use ave to split the data by Name excluding the numbers, and check which group element is equal to its greatest element. ave returns a vector of the same class as its input, in this case character. So coerce to logical and subset the original data frame.

    x<-"
    ID  Name    Score
    1   'Minnesota P2'  342
    2   'Vermont P7'    342
    3   'Texas P4'  65
    4   'New Mexico'    643
    5   'North Dakota P8'   78
    6   'North Dakota P9'   245
    7   'Texas P5'  856
    8   'Minnesota LP'  342"
    df1 <- read.table(textConnection(x), header = TRUE)
    
    
    i <- with(df1, ave(Name, sub("\\d+", "", Name), FUN = \(x){
      x == tail(sort(x), 1)
    }))
    df1[as.logical(i),]
    #>   ID            Name Score
    #> 1  1    Minnesota P2   342
    #> 2  2      Vermont P7   342
    #> 4  4      New Mexico   643
    #> 6  6 North Dakota P9   245
    #> 7  7        Texas P5   856
    #> 8  8    Minnesota LP   342
    

    Created on 2022-07-06 by the reprex package (v2.0.1)