I'm new to R and I was wondering if there is an opposite code of 'which' in R?
So e.g. when I run the code below, then it keeps all the data between 10 and 50 and removes everything else. The code below works for me, there is no problem there.
data <- data[which(data$age>10 & data$age<50),]
But I want to know if there is a code that can do the opposite? Meaning --> I want to remove specific rows from the data, so instead of having a code that says what to keep I want a code that indicates what to remove. If that makes sense? I want to remove a specific row by condition.
I have tried with the subset code, but I can't get it to work. The below code is the code I tried that didn't work
data2 <- subset(data1, data1$gender=='male')
So gender is a column, with females and males. And I want a code to remove the males only.
Although this should be a comment, since you said you are new to R let me take a bit more space to explain this in a non-technical way as it is clear there is some confusion.
First, when you index in R using brackets (ie, df[x,y]
), the x
part (before the comma) looks at rows, and the y
part looks at columns. Your question title asks about removing columns, but your question asks about removing rows. So I will go through both.
Say you have these data (note some have missing (NA
) values):
set.seed(123)
n <- 10
df <- data.frame(Age = sample(c(1:100, NA), n, replace = TRUE),
Gender = sample(c("Male", "Female", NA), n, replace = TRUE),
Cofactor = rep(LETTERS, length.out = n),
Cofactor2 = sample(c("Yes", "No", "Maybe", NA), n, replace = TRUE),
Cofactor3 = runif(n))
# Age Gender Cofactor Cofactor2 Cofactor3
# 1 31 Female A Yes 0.02461368
# 2 79 Male B Maybe 0.47779597
# 3 51 Female C <NA> 0.75845954
# 4 14 <NA> D No 0.21640794
# 5 67 Male E Maybe 0.31818101
# 6 42 <NA> F No 0.23162579
# 7 50 <NA> G Yes 0.14280002
# 8 43 Male H No 0.41454634
# 9 NA Male I Maybe 0.41372433
# 10 14 Male J <NA> 0.36884545
You can index rows by row position using numbers - i.e. if you want to keep or drop the first three rows:
# keep
df[1:3, ]
# drop
df[-c(1:3),]
Notice commands are in the x
indexing position (left of the comma). If you wanted to drop the observations (rows) that were male, you could do it several ways. For instance:
df[!(df$Gender %in% "Male"),]
# or using `which()`
df[-(which(df$Gender %in% "Male")),]
# Age Gender Cofactor1 Cofactor2
# 1 31 Female Yes 0.02461368
# 3 51 Female <NA> 0.75845954
# 4 14 <NA> No 0.21640794
# 6 42 <NA> No 0.23162579
# 7 50 <NA> Yes 0.14280002
The !
means "not" - so this reads, "select rows that are not male" - including NA
values.
If you did this:
df[df$Gender %in% "Female",]
# or
df[which(df$Gender %in% "Female"),]
# Age Gender Cofactor1 Cofactor2
# 1 31 Female Yes 0.02461368
# 3 51 Female <NA> 0.75845954
That would read "include all where gender is female" - notice NA
!= female so they are not included.
Similarly, if you wanted to include both "yes" and "maybe" in Cofactor1
:
df[df$Cofactor1 %in% c("Yes", "Maybe"),]
# Age Gender Cofactor1 Cofactor2
# 1 31 Female Yes 0.02461368
# 2 79 Male Maybe 0.47779597
# 5 67 Male Maybe 0.31818101
# 7 50 <NA> Yes 0.14280002
# 9 NA Male Maybe 0.41372433
Note that I am using %in%
, not ==
, this is because of vector recycling - see what happens when I use ==
(hint, it gives unwanted results):
df[df$Cofactor1 == c("Yes", "Maybe"),]
# Age Gender Cofactor1 Cofactor2
#1 31 Female Yes 0.02461368
#2 79 Male Maybe 0.47779597
#NA NA <NA> <NA> NA
#7 50 <NA> Yes 0.14280002
#NA.1 NA <NA> <NA> NA
The correct way to use ==
is much more verbose (df[(df$Cofactor1 == "Yes"| df$Cofactor1 == "Maybe") & !is.na(df$Cofactor1),]
so using %in%
is a good option here.
Indexing columns is on the y
position of indexing (to the right of the comma). If your data have a large number of unneeded columns, you can simply choose the ones you want to keep by indexing by name (or column number:
df[,c("Age", "Gender")]
# or df[, 1:2]
# Age Gender
# 1 31 Female
# 2 79 Male
# 3 51 Female
# 4 14 <NA>
# 5 67 Male
# 6 42 <NA>
# 7 50 <NA>
# 8 43 Male
# 9 NA Male
# 10 14 Male
But you can only drop columns by number (I know, quirky) - so you cant drop by df[,-c("Age", "Gender")]
but you can drop by df[,-c(1:2)]
In my work it is preferred to drop by name since columns get shifted around a bit - so with names I know exactly what I am dropping. One workaround I use is to use grep
with names(df)
to identify the positions of the columns that I want to drop.
This is a little tricky so be careful. If I want to drop all columns that start with "Cofactor" in the name:
dropcols <- grep("Cofactor", names(df))
# or to ignore case
# grep("Cofactor", names(df), ignore.case = TRUE)
# [1] 3 4 5
If I only wanted to drop Cofactor
but keep Cofactor1
and Cofactor2
, I could use \\b
to put a word boundary on it:
dropcols <- grep("\\bCofactor\\b", names(df))
[1] 3
So to drop the columns, you can simply index like so:
dropcols <- grep("Cofactor", names(df))
df[, -dropcols]
# Age Gender
# 1 31 Female
# 2 79 Male
# 3 51 Female
# 4 14 <NA>
# 5 67 Male
# 6 42 <NA>
# 7 50 <NA>
# 8 43 Male
# 9 NA Male
# 10 14 Male