Search code examples
rfiltercode-generationgenomevcf-variant-call-format

select uniquely mapped genes to SNP


I have a vcf and want to select 100 genes, and for each gene, to select one SNP? Technically if we consider one gene it has many rsid mapped to it. For my analysis I need to select 100 genes and for each gene select only one SNP for it and ignore others and have a final vcf file with only 100 genes uniquely mapped to 1 rsid only. Is there any package which can do this?

22 16050075 rs587697622 A G 100.0 PASS
AC=1;AF=0.000199681;AN=5008;NS=2504;DP=8012;EAS_AF=0;AMR_AF=0;AFR_AF=0;EUR_AF=0;SAS_A F=0.001;AA=.|||;VT=SNP;ANN=G|intergenic_region|MODIFIER|CHR_START-LA16c-4G1.3|CHR_START-ENSG00000233866|intergenic_region|CHR_START-ENSG00000233866|||n.16050 075A>G|||||| GT

This is what I am trying to do: library(vcfR)

# Import VCF
my.vcf <- read.vcfR('my.vcf.gz')
# Combine CHROM thru FILTER cols + INFO cols
my.vcf.df <- cbind(as.data.frame(getFIX(my.vcf)), INFO2df(my.vcf))
data_new <- my.vcf.df[!duplicated(my.vcf.df[ ,"ANN"]),]

This is how my dataset looks like: enter image description here

As you can see the ANN column has several ENSG00000233866 values. I want to filter the df based on ANN values that is gene ID. So for ENSG00000233866 i just want to select only one rsid or rs values and ignore others. I want to do this similarly for other genes.


Solution

  • Consider str.split of the pipe-delimited column in ANN for the gene which is about the fourth pipe. Then run a sequential count by the gene grouping with ave. Then subset group number for the 1. Please not below uses the new pipe, |> in R 4.1.0+:

    data_new <- within(
      data_new, {
        GENE <- gsub(
           "CHR_START-", "", 
           sapply(strsplit(ANN, "|"), "[", 4)
        )
        GENE_GROUP_SEQ <- ave(1:nrow(data_new), GENE, FUN=seq_along)
      }
    ) |> subset(
      GENE %in% unique(GENE)[1:100] &  # SUBSET FOR FIRST 100 GENES
      GENE_GROUP_SEQ == 1              # SUBSET FOR FIRST OF EACH GENE GROUP
    ) |> `row.names<-`(NULL)           # RESET ROW NAMES
    

    For R versions less that 4.1.0, break out the pipe to separate calls:

    data_new <- within(
      data_new, {
        GENE <- gsub(
           "CHR_START-", "", 
           sapply(strsplit(ANN, "|"), "[", 4)
        )
        GENE_GROUP_SEQ <- ave(1:nrow(data_new), GENE, FUN=seq_along)
      }
    )
    
    data_sub <- subset(
      data_new,
      GENE %in% unique(GENE)[1:100] &           # SUBSET FOR FIRST 100 GENES
      GENE_GROUP_SEQ == 1                       # SUBSET FOR FIRST OF EACH GENE GROUP
    )
    
    dat_sub <- `row.names<-`(data_sub, NULL)    # RESET ROW NAMES