Search code examples
unixbioinformaticsvcf-variant-call-formatbcftools

How to match ID in column in unix?


I am fully aware that similar questions may have been posted, but after searching it seems that the details of our questions are different (or at least I did not manage to find a solution that can be adopted in my case).

I currently have two files: "messyFile" and "wantedID". "messyFile" is of size 80,000,000 X 2,500, whereas "wantedID" is of size 1 x 462. On the 253rd line of "messyFile", there are 2500 IDs. However, all I want is the 462 IDs in the file "wantedID". Assuming that the 462 IDs are a subset of the 2500 IDs, how can I process the file "messyFile" such that it only contains information about the 462 IDs (ie. of size 80,000,000 X 462).

Thank you so much for your patience!

ps: Sorry for the confusion. But yeah, the question can be boiled down to something like this. In the 1st row of "File#1", there are 10 IDs. In the 1st row of "File#2", there are 3 IDs ("File#2" consists of only 1 line). The 3 IDs are a subset of the 10 IDs. Now, I hope to process "File#1" so that it contains only information about the 3 IDs listed in "File#2".

ps2: "messyFile" is a vcf file, whereas "wantedID" can be a text file (I said "can be" because it is small, so I can make almost any type for it)

ps3: "File#1" should look something like this:

sample#1 sample#2 sample#3 sample#4 sample#5
    0        1       0        0        1
    1        1       2        0        2

"File#2" should look something like this:

sample#2 sample#4 sample#5

Desired output should look like this:

sample#2 sample#4 sample#5
   1        0        1
   1        0        2

Solution

  • For parsing VCF format, use bcftools:

    http://samtools.github.io/bcftools/bcftools.html

    Specifically for your task see the view command:

    http://samtools.github.io/bcftools/bcftools.html#view

    Example:

    bcftools view -Ov -S 462sample.list -r chr:pos -o subset.vcf superset.vcf
    

    You will need to get the position of the SNP to specify chr:pos above.

    You can do this using DbSNP:

    http://www.ncbi.nlm.nih.gov/SNP/index.html

    Just make sure to match the genome build to the one used in the VCF file.

    You can also use plink:

    https://www.cog-genomics.org/plink2

    But, PLINK is finicky about duplicated SNPs and other things, so it may complain unless you address these issues.

    I've done what you are attempting in the past using the awk programming language. For your sanity, I recommend using one of the above tools :)