Search code examples
rduplicatesmultiple-columns

removing specific duplicates in R


I have data that looks like this ( sample of it ) :

snp_id=c("chr16-54319851-C-A","chr16-54319851-C-A","chr16-54319851-C-A","chr16-54319851-C-A","chr10-100003732-A-G","")
AF_total=c("-","-",0.1,0.1,"-","-")
df=data.frame(snp_id,AF_total)

It can be seen that there are duplicates. What I wish to do is delete from this data frame the records that have "-" in the AF_total but only if the value in snp_id column is duplicated ( in this case it is) and there are value besides "-" (0.1) and after that delete the extra row with 0.1 value. For the other snp_id value I have duplicate of "-" so i would like to live only one "-" so in the end the data will look like this:

snp id                AF_total
chr16-54319851-C-A       0.1     
chr10-100003732-A-G       -

I looked at some QA like this :Remove duplicates in one column based on another column but sadly they didnt help. would be happy for help , thank you


Solution

  • df %>%
      # grouped by snp_id
      group_by(snp_id) %>%
      # keep either (
      #  all the rows if all the AF_total values are "-"
      #  OR
      #  if there are some non-"-" values keep only the non-"-" values
      filter(all(AF_total == "-") | AF_total != "-") %>%
      ungroup() %>%
      # of the rows that remain, de-duplicate
      distinct(snp_id, AF_total, .keep_all = TRUE) %>%
      # drop any rows (like your last sample row) that have a blank snp_id
      filter(snp_id != "")
    # # A tibble: 2 × 2
    #   snp_id              AF_total
    #   <chr>               <chr>   
    # 1 chr16-54319851-C-A  0.1     
    # 2 chr10-100003732-A-G -