I working a on bash script made by a colleague. Since there is no multidimensional data structure in bash, there was lots of workarounds to process the input files. When I saw his script, I thought Perl can handle that much more easily. I did a decent job, I think, at doing the conversion, but now I'm stuck. For those who know, the input files are in Variant Calling Format (VCF). Basically a tab-delimited text file with specific bits of information in specific columns. Here's what the input file looks like:
##fileformat=VCFv4.2
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT MBWGS139
AF2122_NC002945 1012102 . A G 1137.77 . AC=2;AF=1.00;AN=2 GT:AD:DP:GQ:PL 1/1:0,29:29:87:1166,87,0
AF2122_NC002945 103811 . C G 241.84 . AC=2;AF=1.00;AN=2 GT:AD:DP:GQ:PL 1/1:0,6:6:18:270,18,0
First thing I do is to store all the file information in a hash. Multiple files (or samples) are all integrated in that "master" hash, by looping each files line by line. The master hash in constructed like this:
push(@{ $vcfs{$sample}{$CHROM}{$POS} }, $ID, $REF, $ALT, $QUAL, $FILTER, $INFO, $FORMAT, $SAMPLE);
Then I need to filter out the "bad" positions. I'll skip that code here, but the bits of information that I keep are stored in a new hash which is created like this:
push ( @{ $fastas{$sample}{$chrom}{$pos} }, $allele);
Let's say that $allele = $ALT (but it's a bit more complicated that that). So inside %fasta it looks like that:
Sample1
chrom1
pos1 = A
pos2 = T
chrom2
pos1 = A
pos2 = G
Sample2
chrom1
pos1 = C
pos2 = T
chrom2
pos1 = A
pos2 = G
See this as if I had uses Data::Dumper to print it... You can also see that for that part of the code, I ditched a bunch a info.
The last filtering step is to remove the "alleles" that are common to all the samples at a specific position. In the previous example, only pos1 of chrom would be kept. The final hash would look like this:
Sample1
chrom1
pos1 = A
Sample2
chrom1
pos1 = C
I figured out a way to do it by ditching the position information. I'd need to keep it. I'd really appreciate if someone could help me out here. I can provide my current solution if that would help.
Thanks, Marco
I found 2 ways to get the desired output. But, seeing as there are only 4 alleles to consider, I wonder how it is possible that there may be one that is unique in a larger data set.
The solution includes a %count
hash that can be used to find unique alleles.
#!/usr/bin/perl
use strict;
use warnings;
my (%fastas, %counts);
my ($sample, $chr);
while (<DATA>) {
if (/^(\S.+)$/) {
$sample = $1;
}
elsif (/^\t(\S.+)$/) {
$chr = $1;
}
elsif (/^\t\t(pos\d+)\s=\s([TAGC])$/) {
my $pos = $1,
my $allele = $2;
push @{ $fastas{$sample}{$chr}{$pos} }, $allele;
$counts{$allele}++;
}
else {
die "Unknown input line $.: <<$_>>";
}
}
for my $sample (keys %fastas) {
my $chr_ref = $fastas{$sample};
for my $chr (keys %$chr_ref) {
my $pos_ref = $chr_ref->{$chr};
for my $pos (keys %$pos_ref) {
my $allele_ref = $pos_ref->{$pos};
for my $allele (@$allele_ref) {
print "$sample $chr $pos $allele\n"
if $counts{$allele} == 1;
}
}
}
}
__DATA__
Sample1
chrom1
pos1 = A
pos2 = T
chrom2
pos1 = A
pos2 = G
Sample2
chrom1
pos1 = C
pos2 = T
chrom2
pos1 = A
pos2 = G
The second solution makes use of the Deep::Hash::Utils module. It simplifies the code by eliminating the nested loops in the first solution.
#!/usr/bin/perl
use strict;
use warnings;
use Deep::Hash::Utils 'reach';
my (%fastas, %counts);
my ($sample, $chr);
while (<DATA>) {
if (/^(\S.+)$/) {
$sample = $1;
}
elsif (/^\t(\S.+)$/) {
$chr = $1;
}
elsif (/^\t\t(pos\d+)\s=\s([TAGC])$/) {
my $pos = $1,
my $allele = $2;
push @{ $fastas{$sample}{$chr}{$pos} }, $allele;
$counts{$allele}++;
}
else {
die "Unknown input line $.: <<$_>>";
}
}
while (my @list = reach(\%fastas)) {
my $allele = $list[-1];
print "@list\n" if $counts{$allele} == 1;
}
__DATA__
Sample1
chrom1
pos1 = A
pos2 = T
chrom2
pos1 = A
pos2 = G
Sample2
chrom1
pos1 = C
pos2 = T
chrom2
pos1 = A
pos2 = G
On the data set you provided, both solutions printed Sample2 chrom1 pos1 C
.