Search code examples
arraysperlhashhash-of-hashes

Perl -> compare same postitions in arrays in a hash of hashes of arrays


I working a on bash script made by a colleague. Since there is no multidimensional data structure in bash, there was lots of workarounds to process the input files. When I saw his script, I thought Perl can handle that much more easily. I did a decent job, I think, at doing the conversion, but now I'm stuck. For those who know, the input files are in Variant Calling Format (VCF). Basically a tab-delimited text file with specific bits of information in specific columns. Here's what the input file looks like:

##fileformat=VCFv4.2
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  MBWGS139
AF2122_NC002945 1012102 .   A   G   1137.77 .   AC=2;AF=1.00;AN=2   GT:AD:DP:GQ:PL  1/1:0,29:29:87:1166,87,0
AF2122_NC002945 103811  .   C   G   241.84  .   AC=2;AF=1.00;AN=2   GT:AD:DP:GQ:PL  1/1:0,6:6:18:270,18,0

First thing I do is to store all the file information in a hash. Multiple files (or samples) are all integrated in that "master" hash, by looping each files line by line. The master hash in constructed like this:

push(@{ $vcfs{$sample}{$CHROM}{$POS} }, $ID, $REF, $ALT, $QUAL, $FILTER, $INFO, $FORMAT, $SAMPLE);

Then I need to filter out the "bad" positions. I'll skip that code here, but the bits of information that I keep are stored in a new hash which is created like this:

push ( @{ $fastas{$sample}{$chrom}{$pos} }, $allele);

Let's say that $allele = $ALT (but it's a bit more complicated that that). So inside %fasta it looks like that:

Sample1
    chrom1
        pos1 = A
        pos2 = T
    chrom2
        pos1 = A
        pos2 = G
Sample2
    chrom1
        pos1 = C
        pos2 = T
    chrom2
        pos1 = A
        pos2 = G

See this as if I had uses Data::Dumper to print it... You can also see that for that part of the code, I ditched a bunch a info.

The last filtering step is to remove the "alleles" that are common to all the samples at a specific position. In the previous example, only pos1 of chrom would be kept. The final hash would look like this:

Sample1
    chrom1
        pos1 = A
Sample2
    chrom1
        pos1 = C

I figured out a way to do it by ditching the position information. I'd need to keep it. I'd really appreciate if someone could help me out here. I can provide my current solution if that would help.

Thanks, Marco


Solution

  • I found 2 ways to get the desired output. But, seeing as there are only 4 alleles to consider, I wonder how it is possible that there may be one that is unique in a larger data set.

    The solution includes a %count hash that can be used to find unique alleles.

    #!/usr/bin/perl
    use strict;
    use warnings;
    
    my (%fastas, %counts);
    
    my ($sample, $chr);
    while (<DATA>) {
        if (/^(\S.+)$/) {
            $sample = $1;   
        }
        elsif (/^\t(\S.+)$/) {
            $chr = $1;  
        }
        elsif (/^\t\t(pos\d+)\s=\s([TAGC])$/) {
            my $pos = $1,
            my $allele = $2;
            push @{ $fastas{$sample}{$chr}{$pos} }, $allele;
            $counts{$allele}++; 
        }
        else {
            die "Unknown input line $.: <<$_>>";    
        }
    }
    
    for my $sample (keys %fastas) {
        my $chr_ref = $fastas{$sample};
    
        for my $chr (keys %$chr_ref) {
            my $pos_ref = $chr_ref->{$chr};
    
            for my $pos (keys %$pos_ref) {
                my $allele_ref = $pos_ref->{$pos};
    
                for my $allele (@$allele_ref) {
                    print "$sample $chr $pos $allele\n"
                        if $counts{$allele} == 1;   
                }
            }
        }   
    }
    
    __DATA__
    Sample1
        chrom1
            pos1 = A
            pos2 = T
        chrom2
            pos1 = A
            pos2 = G
    Sample2
        chrom1
            pos1 = C
            pos2 = T
        chrom2
            pos1 = A
            pos2 = G
    

    The second solution makes use of the Deep::Hash::Utils module. It simplifies the code by eliminating the nested loops in the first solution.

    #!/usr/bin/perl
    use strict;
    use warnings;
    use Deep::Hash::Utils 'reach';
    
    my (%fastas, %counts);
    
    my ($sample, $chr);
    while (<DATA>) {
        if (/^(\S.+)$/) {
            $sample = $1;   
        }
        elsif (/^\t(\S.+)$/) {
            $chr = $1;  
        }
        elsif (/^\t\t(pos\d+)\s=\s([TAGC])$/) {
            my $pos = $1,
            my $allele = $2;
            push @{ $fastas{$sample}{$chr}{$pos} }, $allele;
            $counts{$allele}++; 
        }
        else {
            die "Unknown input line $.: <<$_>>";    
        }
    }
    
    while (my @list = reach(\%fastas)) {
        my $allele = $list[-1];
        print "@list\n" if $counts{$allele} == 1;
    }
    
    __DATA__
    Sample1
        chrom1
            pos1 = A
            pos2 = T
        chrom2
            pos1 = A
            pos2 = G
    Sample2
        chrom1
            pos1 = C
            pos2 = T
        chrom2
            pos1 = A
            pos2 = G
    

    On the data set you provided, both solutions printed Sample2 chrom1 pos1 C.