Search code examples
rmappingbioinformaticsblast

Remove rows that contain different letters or missing data for two columns


I'm analyzing a big dataset in R studio and I am not very experienced in programming. I would want to remove the rows that contain different letters for columns CONSENSUSMAP and SVEVOMAP. Also, if missing data for column "CONSENSUSMAP".

I write this table as an example:

CLONEID | CONSENSUSMAP| SVEVOMAP
1228104 |      NA     |    chr1A
2277691 |      NA     |    chr1A
2277607 |      1A     |    chr1A
1E+08   |      NA     |    chr1A
1229677 |      1B     |    chr1A
1126457 |      7B     |    chr7B

I would like to obtain the following output:

CLONEID | CONSENSUSMAP| SVEVOMAP
2277607 |       1A    |    chr1A
1126457 |       7B    |    chr7B

I tried some codes but none of them fits these specific conditions. Any suggestions?


Solution

  • The following dplyr solution will do what the question asks for.

    library(dplyr)
    
    df1 %>%
      filter(!is.na(CONSENSUSMAP)) %>%
      mutate(newcol = sub("^[^[:digit:]]*(\\d+.*$)", "\\1", SVEVOMAP)) %>%
      filter(CONSENSUSMAP == newcol) %>%
      select(-newcol)
    #  CLONEID CONSENSUSMAP SVEVOMAP
    #1 2277607           1A    chr1A
    #2 1126457           7B    chr7B
    

    Edit.

    Here are two other ways, both with dplyr, the second one uses package stringr.

    df1 %>%
      filter(!is.na(CONSENSUSMAP)) %>%
      rowwise() %>%
      filter(grepl(CONSENSUSMAP, SVEVOMAP))
    #Source: local data frame [2 x 3]
    #Groups: <by row>
    #
    ## A tibble: 2 x 3
    #  CLONEID CONSENSUSMAP SVEVOMAP
    #  <chr>   <chr>        <chr>   
    #1 2277607 1A           chr1A   
    #2 1126457 7B           chr7B   
    
    
    df1 %>%
      filter(!is.na(CONSENSUSMAP)) %>%
      filter(stringr::str_detect(SVEVOMAP, CONSENSUSMAP))
    #  CLONEID CONSENSUSMAP SVEVOMAP
    #1 2277607           1A    chr1A
    #2 1126457           7B    chr7B
    

    Data.

    df1 <-
    structure(list(CLONEID = c("1228104", "2277691", "2277607", "1e+08", 
    "1229677", "1126457"), CONSENSUSMAP = c(NA, NA, "1A", NA, "1B", 
    "7B"), SVEVOMAP = c("chr1A", "chr1A", "chr1A", "chr1A", "chr1A", 
    "chr7B")), row.names = c(NA, -6L), class = "data.frame")