Search code examples
rconditional-statementsdelete-row

Deleting rows in R conditionally


I have a data where the first column is a bunch of ID numbers (some repeat), and the second column is just a bunch of numbers. I need a way to keep each ID number only once based on the smallest number in the second column.

Row#   ID   Number
1      10     180
2      12     167
3      12     182
4      12     135
5      15     152
6      15     133

Ex: I only want to keep Row# 1, 4, and 6 here and delete the rest


Solution

  • For selecting the row that has the minimum 'Number' for each 'ID' group, we can use one of the aggregating by group function. A base R option is aggregate. With aggregate, we can either use the 'formula' method or specify a list of grouping elements/variables with the by argument. Using the formula method, we get the min value of 'Number' for each 'ID'.

    aggregate(Number~ID, df1, FUN=min)
    

    Or we can use a faster option with data.table. Here, we convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'ID', we get the min value of "Number".

    library(data.table)
    setDT(df1)[, list(Number=min(Number)), by = ID] 
    

    Or this can be also done with setorder to order the 'Number' column and use unique with by option to select the first non-duplicated 'ID' row. (from @David Arenburgs' comments)

     unique(setorder(setDT(df1), Number), by = "ID")
    

    Or using dplyr, we group by 'ID' and get the subset rows with summarise.

    library(dplyr)
    df1 %>%
       group_by(ID) %>%
       summarise(Number= min(Number))
    

    Or we can use sqldf syntax to get the subset of data.

    library(sqldf)
    sqldf('select ID,
            min(Number) as Number
            from df1 
            group by ID')
    

    Update

    If there are multiple columns and you want to get the row based on the minimum value of 'Number' for each 'ID', you can use which.min. Using .I will get the row index and that can be used for subsetting the rows.

    setDT(df1)[df1[,  .I[which.min(Number)], by = ID]$V1]
    

    Or with dplyr we use slice to filter out the rows that have the min value of 'Number' for each 'ID'

    df1 %>% 
        group_by(ID) %>%
        slice(which.min(Number))