I'm analyzing a big dataset in R studio and I am not very experienced in programming. I would want to remove the rows that contain different letters for columns CONSENSUSMAP and SVEVOMAP. Also, if missing data for column "CONSENSUSMAP".
I write this table as an example:
CLONEID | CONSENSUSMAP| SVEVOMAP
1228104 | NA | chr1A
2277691 | NA | chr1A
2277607 | 1A | chr1A
1E+08 | NA | chr1A
1229677 | 1B | chr1A
1126457 | 7B | chr7B
I would like to obtain the following output:
CLONEID | CONSENSUSMAP| SVEVOMAP
2277607 | 1A | chr1A
1126457 | 7B | chr7B
I tried some codes but none of them fits these specific conditions. Any suggestions?
The following dplyr
solution will do what the question asks for.
library(dplyr)
df1 %>%
filter(!is.na(CONSENSUSMAP)) %>%
mutate(newcol = sub("^[^[:digit:]]*(\\d+.*$)", "\\1", SVEVOMAP)) %>%
filter(CONSENSUSMAP == newcol) %>%
select(-newcol)
# CLONEID CONSENSUSMAP SVEVOMAP
#1 2277607 1A chr1A
#2 1126457 7B chr7B
Edit.
Here are two other ways, both with dplyr
, the second one uses package stringr
.
df1 %>%
filter(!is.na(CONSENSUSMAP)) %>%
rowwise() %>%
filter(grepl(CONSENSUSMAP, SVEVOMAP))
#Source: local data frame [2 x 3]
#Groups: <by row>
#
## A tibble: 2 x 3
# CLONEID CONSENSUSMAP SVEVOMAP
# <chr> <chr> <chr>
#1 2277607 1A chr1A
#2 1126457 7B chr7B
df1 %>%
filter(!is.na(CONSENSUSMAP)) %>%
filter(stringr::str_detect(SVEVOMAP, CONSENSUSMAP))
# CLONEID CONSENSUSMAP SVEVOMAP
#1 2277607 1A chr1A
#2 1126457 7B chr7B
Data.
df1 <-
structure(list(CLONEID = c("1228104", "2277691", "2277607", "1e+08",
"1229677", "1126457"), CONSENSUSMAP = c(NA, NA, "1A", NA, "1B",
"7B"), SVEVOMAP = c("chr1A", "chr1A", "chr1A", "chr1A", "chr1A",
"chr7B")), row.names = c(NA, -6L), class = "data.frame")