Search code examples
bashawkbioinformaticsgenome

Extracting certain locus from multiple samples from text file


After profiling STR locus in a population, the output gave me 122 files each of which contains about unique 800,000 locus. There are 2 examples of my files:

SAMPLE  CHROM   POS Allele_1    Allele_2    LENGTH
HG02035 chr1 230769616 (tcta)14 (tcta)16 4
HG02035 chr2 1489653 (aatg)8 (aatg)11 4
HG02035 chr2 68011947 (tcta)11 (tcta)11 4
HG02035 chr2 218014855 (ggaa)16 (ggaa)16 4
HG02035 chr3 45540739 (tcta)15 (tcta)16 43
SAMPLE  CHROM   POS Allele_1    Allele_2    LENGTH
HG02040 chr1 230769616 (tcta)15 (tcta)15 4
HG02040 chr2 1489653 (aatg)8 (aatg)8 4
HG02040 chr2 68011947 (tcta)10 (tcta)10 4
HG02040 chr2 218014855 (ggaa)21 (ggaa)21 4
HG02040 chr3 45540739 (tcta)17 (tcta)17 4

I've been trying to extract variants for each of 800,000 STR locus. I expect the output should be like this for chromosome 1 at position of 230769616:

HG02035 chr1 230769616 (tcta)14 (tcta)16 4
HG02040 chr1 230769616 (tcta)15 (tcta)15 4
HG02072 chr1 230769616 (tcta)10 (tcta)15 4
HG02121 chr1 230769616 (tcta)2 (tcta)2 4
HG02131 chr1 230769616 (tcta)16 (tcta)16 4
HG02513 chr1 230769616 (tcta)14 (tcta)14 4

I tried this command: awk '$1!="SAMPLE" {print $0 > $2"_"$3".locus.tsv"}' *.vcf

It worked but it take lots of time to create large number of files for each locus.

I am struggling to find an optimal solution to solve this.


Solution

  • You aren't closing the output files as you go so if you have a large number of them then your script will either slow down significantly trying to manage them all (e.g. with gawk) or fail saying "too many output files" (with most other awks).

    Assuming you want to get a separate output file for every $2+$3 pair, you should be using the following with any awk:

    tail -n +2 -q *.vcf | sort -k2,3 |
    awk '
        { cur = $2 "_" $3 ".locus.tsv" }
        cur != out { close(out); out=cur }
        { print > out }
    '
    

    If you want to have the header line present in every output file then tweak that to:

    { head -n 1 file1.vcf; tail -n +2 -q *.vcf | sort -k2,3; } |
    awk '
        NR==1 { hdr=$0; next }
        { cur = $2 "_" $3 ".locus.tsv" }
        cur != out { close(out); out=cur; print hdr > out }
        { print > out }
    '