I've got a small programme that basically processes lists of blast hits, and checks to see if there's overlap between the blast results by iterating blast results (as hash key) through hashes containing each blast list.
This involves processing each blast input file as $ARGV in the same way. Depending on what I'm trying to achieve, I might want to compare 2, 3 or 4 blast lists for gene overlap. I want to know how I can write the basic processing block as a subroutine that I can iterate over for however many $ARGV arguments exist.
For example, the below works fine if I input 2 blast lists:
#!/usr/bin/perl -w
use strict;
use File::Slurp;
use Data::Dumper;
$Data::Dumper::Sortkeys = 1;
if ($#ARGV != 1){
die "Usage: intersect.pl <de gene list 1><de gene list 2>\n"
}
my $input1 = $ARGV[0];
open my $blast1, '<', $input1 or die $!;
my $results1 = 0;
my (@blast1ID, @blast1_info, @split);
while (<$blast1>) {
chomp;
@split = split('\t');
push @blast1_info, $split[0];
push @blast1ID, $split[2];
$results1++;
}
print "$results1 blast hits in $input1\n";
my %blast1;
push @{$blast1{$blast1ID[$_]} }, [ $blast1_info[$_] ] for 0 .. $#blast1ID;
#print Dumper (\%blast1);
my $input2 = $ARGV[1];
open my $blast2, '<', $input2 or die $!;
my $results2 = 0;
my (@blast2ID, @blast2_info);
while (<$blast2>) {
chomp;
@split = split('\t');
push @blast2_info, $split[0];
push @blast2ID, $split[2];
$results2++;
}
my %blast2;
push @{$blast2{$blast2ID[$_]} }, [ $blast2_info[$_] ] for 0 .. $#blast2ID;
#print Dumper (\%blast2);
print "$results2 blast hits in $input2\n";
But I would like to be able to adjust it to cater for 3 or 4 blast lists inputs. I imagine a sub routine would work best for this, that is invoked for each input, and might look something like this:
sub process {
my $input$i = $ARGV[$i-1];
open my $blast$i, '<', $input[$i] or die $!;
my $results$i = 0;
my (@blast$iID, @blast$i_info, @split);
while (<$blast$i>) {
chomp;
@split = split('\t');
push @blast$i_info, $split[0];
push @blast$iID, $split[2];
$results$i++;
}
print "$results$i blast hits in $input$i\n";
print Dumper (\@blast$i_info);
print Dumper (\@blast$iID);
# Call sub 'process for every ARGV...
&process for 0 .. $#ARGV;
UPDATE:
I've removed the hash part for the last snippet.
The resultant data structure should be:
4 blast hits in <$input$i>
$VAR1 = [
'TCONS_00001332(XLOC_000827),_4.60257:9.53943,_Change:1.05146,_p:0.03605,_q:0.998852',
'TCONS_00001348(XLOC_000833),_0.569771:6.50403,_Change:3.51288,_p:0.0331,_q:0.998852',
'TCONS_00001355(XLOC_000837),_10.8634:24.3785,_Change:1.16613,_p:0.001,_q:0.998852',
'TCONS_00002204(XLOC_001374),_0.316322:5.32111,_Change:4.07226,_p:0.00485,_q:0.998852',
];
$VAR1 = [
'gi|50418055|gb|BC078036.1|_Xenopus_laevis_cDNA_clone_MGC:82763_IMAGE:5156829,_complete_cds',
'gi|283799550|emb|FN550108.1|_Xenopus_(Silurana)_tropicalis_mRNA_for_alpha-2,3-sialyltransferase_ST3Gal_V_(st3gal5_gene)',
'gi|147903202|ref|NM_001097651.1|_Xenopus_laevis_forkhead_box_I4,_gene_1_(foxi4.1),_mRNA',
'gi|2598062|emb|AJ001730.1|_Xenopus_laevis_mRNA_for_Xsox17-alpha_protein',
];
And the input:
TCONS_00001332(XLOC_000827),_4.60257:9.53943,_Change:1.05146,_p:0.03605,_q:0.998852 0.0 gi|50418055|gb|BC078036.1|_Xenopus_laevis_cDNA_clone_MGC:82763_IMAGE:5156829,_complete_cds
TCONS_00001348(XLOC_000833),_0.569771:6.50403,_Change:3.51288,_p:0.0331,_q:0.998852 0.0 gi|283799550|emb|FN550108.1|_Xenopus_(Silurana)_tropicalis_mRNA_for_alpha-2,3-sialyltransferase_ST3Gal_V_(st3gal5_gene)
TCONS_00001355(XLOC_000837),_10.8634:24.3785,_Change:1.16613,_p:0.001,_q:0.998852 0.0 gi|147903202|ref|NM_001097651.1|_Xenopus_laevis_forkhead_box_I4,_gene_1_(foxi4.1),_mRNA
TCONS_00002204(XLOC_001374),_0.316322:5.32111,_Change:4.07226,_p:0.00485,_q:0.998852 0.0 gi|2598062|emb|AJ001730.1|_Xenopus_laevis_mRNA_for_Xsox17-alpha_protein
You can't inject a variable value in the middle of a variable name. (Well, you can but you shouldn't. Even then you and can't use array indexing in the middle of the name.)
These names aren't valid:
@blast[$i]_info
@blast[$i]_ID
You need to move the index to the end:
@blast_info[$i]
@blast_ID[$i]
That said, I'd get rid of the arrays completely and use a hash instead.
Your second code snippet doesn't show a call to your subroutine. Unless it's explicitly called it will never run and your program will do nothing. I'd modify the process
sub to take a single argument and call it for each element of @ARGV
. e.g.
process($_) foreach @ARGV;
Here's how I'd write your program:
use strict;
use warnings;
use Data::Dumper;
my @blast;
push @blast, process($_) foreach @ARGV;
print Dumper(\@blast);
sub process {
my $file = shift;
open my $fh, '<', $file or die "Can't read file '$file' [$!]\n";
my %data;
while (<$fh>) {
chomp;
my ($id, undef, $info) = split '\t';
$data{$id} = $info;
}
return \%data;
}
It isn't quite clear what your resulting data structure should look like. (I took my best guess.) I recommend reading perlreftut to gain a better basic understanding of references and using them to build data structures in Perl.