Search code examples
rsubset

How to remove rows by condition in R?


I'm new to R and I was wondering if there is an opposite code of 'which' in R?

So e.g. when I run the code below, then it keeps all the data between 10 and 50 and removes everything else. The code below works for me, there is no problem there.

data <- data[which(data$age>10 & data$age<50),]

But I want to know if there is a code that can do the opposite? Meaning --> I want to remove specific rows from the data, so instead of having a code that says what to keep I want a code that indicates what to remove. If that makes sense? I want to remove a specific row by condition.

I have tried with the subset code, but I can't get it to work. The below code is the code I tried that didn't work

data2 <- subset(data1, data1$gender=='male')

So gender is a column, with females and males. And I want a code to remove the males only.


Solution

  • Although this should be a comment, since you said you are new to R let me take a bit more space to explain this in a non-technical way as it is clear there is some confusion.

    First, when you index in R using brackets (ie, df[x,y]), the x part (before the comma) looks at rows, and the y part looks at columns. Your question title asks about removing columns, but your question asks about removing rows. So I will go through both.

    Say you have these data (note some have missing (NA) values):

    set.seed(123)
    n <- 10
    df <- data.frame(Age = sample(c(1:100, NA), n, replace = TRUE),
                     Gender = sample(c("Male", "Female", NA), n, replace = TRUE),
                     Cofactor = rep(LETTERS, length.out = n),
                     Cofactor2 = sample(c("Yes", "No", "Maybe", NA), n, replace = TRUE),
                     Cofactor3 = runif(n))
    
    #    Age Gender Cofactor Cofactor2  Cofactor3
    # 1   31 Female        A       Yes 0.02461368
    # 2   79   Male        B     Maybe 0.47779597
    # 3   51 Female        C      <NA> 0.75845954
    # 4   14   <NA>        D        No 0.21640794
    # 5   67   Male        E     Maybe 0.31818101
    # 6   42   <NA>        F        No 0.23162579
    # 7   50   <NA>        G       Yes 0.14280002
    # 8   43   Male        H        No 0.41454634
    # 9   NA   Male        I     Maybe 0.41372433
    # 10  14   Male        J      <NA> 0.36884545
    

    Dropping Rows

    You can index rows by row position using numbers - i.e. if you want to keep or drop the first three rows:

    # keep 
    df[1:3, ]
    
    # drop
    df[-c(1:3),]
    

    Notice commands are in the x indexing position (left of the comma). If you wanted to drop the observations (rows) that were male, you could do it several ways. For instance:

    df[!(df$Gender %in% "Male"),]
    
    # or using `which()`
    df[-(which(df$Gender %in% "Male")),]
    
    #   Age Gender Cofactor1  Cofactor2
    # 1  31 Female       Yes 0.02461368
    # 3  51 Female      <NA> 0.75845954
    # 4  14   <NA>        No 0.21640794
    # 6  42   <NA>        No 0.23162579
    # 7  50   <NA>       Yes 0.14280002
    

    The ! means "not" - so this reads, "select rows that are not male" - including NA values.

    If you did this:

    df[df$Gender %in% "Female",]
    
    # or 
    df[which(df$Gender %in% "Female"),]
    
    #   Age Gender Cofactor1  Cofactor2
    # 1  31 Female       Yes 0.02461368
    # 3  51 Female      <NA> 0.75845954
    

    That would read "include all where gender is female" - notice NA != female so they are not included.

    Similarly, if you wanted to include both "yes" and "maybe" in Cofactor1:

    df[df$Cofactor1 %in% c("Yes", "Maybe"),]
    
    #   Age Gender Cofactor1  Cofactor2
    # 1  31 Female       Yes 0.02461368
    # 2  79   Male     Maybe 0.47779597
    # 5  67   Male     Maybe 0.31818101
    # 7  50   <NA>       Yes 0.14280002
    # 9  NA   Male     Maybe 0.41372433
    

    Note that I am using %in%, not ==, this is because of vector recycling - see what happens when I use == (hint, it gives unwanted results):

    df[df$Cofactor1 == c("Yes", "Maybe"),]
    
    #     Age Gender Cofactor1  Cofactor2
    #1     31 Female       Yes 0.02461368
    #2     79   Male     Maybe 0.47779597
    #NA    NA   <NA>      <NA>         NA
    #7     50   <NA>       Yes 0.14280002
    #NA.1  NA   <NA>      <NA>         NA
    

    The correct way to use == is much more verbose (df[(df$Cofactor1 == "Yes"| df$Cofactor1 == "Maybe") & !is.na(df$Cofactor1),] so using %in% is a good option here.

    Keeping/Dropping Columns

    Indexing columns is on the y position of indexing (to the right of the comma). If your data have a large number of unneeded columns, you can simply choose the ones you want to keep by indexing by name (or column number:

    df[,c("Age", "Gender")]
    
    # or df[, 1:2]
    
    #    Age Gender
    # 1   31 Female
    # 2   79   Male
    # 3   51 Female
    # 4   14   <NA>
    # 5   67   Male
    # 6   42   <NA>
    # 7   50   <NA>
    # 8   43   Male
    # 9   NA   Male
    # 10  14   Male
    

    But you can only drop columns by number (I know, quirky) - so you cant drop by df[,-c("Age", "Gender")] but you can drop by df[,-c(1:2)]

    In my work it is preferred to drop by name since columns get shifted around a bit - so with names I know exactly what I am dropping. One workaround I use is to use grep with names(df) to identify the positions of the columns that I want to drop.

    This is a little tricky so be careful. If I want to drop all columns that start with "Cofactor" in the name:

    dropcols <- grep("Cofactor", names(df))
    
    # or to ignore case
    # grep("Cofactor", names(df), ignore.case = TRUE)
    # [1] 3 4 5
    

    If I only wanted to drop Cofactor but keep Cofactor1 and Cofactor2, I could use \\b to put a word boundary on it:

    dropcols <- grep("\\bCofactor\\b", names(df))
    [1] 3
    

    So to drop the columns, you can simply index like so:

    dropcols <- grep("Cofactor", names(df))
    df[, -dropcols]
    
    #    Age Gender
    # 1   31 Female
    # 2   79   Male
    # 3   51 Female
    # 4   14   <NA>
    # 5   67   Male
    # 6   42   <NA>
    # 7   50   <NA>
    # 8   43   Male
    # 9   NA   Male
    # 10  14   Male