I have data that looks like this ( sample of it ) :
snp_id=c("chr16-54319851-C-A","chr16-54319851-C-A","chr16-54319851-C-A","chr16-54319851-C-A","chr10-100003732-A-G","")
AF_total=c("-","-",0.1,0.1,"-","-")
df=data.frame(snp_id,AF_total)
It can be seen that there are duplicates. What I wish to do is delete from this data frame the records that have "-" in the AF_total but only if the value in snp_id column is duplicated ( in this case it is) and there are value besides "-" (0.1) and after that delete the extra row with 0.1 value. For the other snp_id value I have duplicate of "-" so i would like to live only one "-" so in the end the data will look like this:
snp id AF_total
chr16-54319851-C-A 0.1
chr10-100003732-A-G -
I looked at some QA like this :Remove duplicates in one column based on another column but sadly they didnt help. would be happy for help , thank you
df %>%
# grouped by snp_id
group_by(snp_id) %>%
# keep either (
# all the rows if all the AF_total values are "-"
# OR
# if there are some non-"-" values keep only the non-"-" values
filter(all(AF_total == "-") | AF_total != "-") %>%
ungroup() %>%
# of the rows that remain, de-duplicate
distinct(snp_id, AF_total, .keep_all = TRUE) %>%
# drop any rows (like your last sample row) that have a blank snp_id
filter(snp_id != "")
# # A tibble: 2 × 2
# snp_id AF_total
# <chr> <chr>
# 1 chr16-54319851-C-A 0.1
# 2 chr10-100003732-A-G -