I have a tab-delimited text file like this:
contig11 GO:100 other columns of data
contig11 GO:289 other columns of data
contig11 GO:113 other columns of data
contig22 GO:388 other columns of data
contig22 GO:101 other columns of data
And another like this:
contig11 3 N
contig11 1 Y
contig22 1 Y
contig22 2 N
I need to combine them so that each 'multiple' entry of one of the files is duplicated and populated with its data in the other, so that I get:
contig11 3 N GO:100 other columns of data
contig11 3 N GO:289 other columns of data
contig11 3 N GO:113 other columns of data
contig11 1 Y GO:100 other columns of data
contig11 1 Y GO:289 other columns of data
contig11 1 Y GO:113 other columns of data
contig22 1 Y GO:388 other columns of data
contig22 1 Y GO:101 other columns of data
contig22 2 N GO:388 other columns of data
contig22 2 N GO:101 other columns of data
I have little scripting experience, but have done this where e.g. "contig11" occurs only once in one of the files, with hashes/keys. But I can't even begin to get my head around to do this! Really appreciate some help or hints as to how to tackle this problem.
EDIT So I have tried ikegami's suggestion (see answers) with this: However, this has produced the output I needed except the GO:100 column onwards ($rest in script???) - any ideas what I'm doing wrong?
#!/usr/bin/env/perl
use warnings;
open (GOTERMS, "$ARGV[0]") or die "Error opening the input file with GO terms";
open (SNPS, "$ARGV[1]") or die "Error opening the input file with SNPs";
my %goterm;
while (<GOTERMS>)
{
my($id, $rest) = /^(\S++)(,*)/s;
push @{$goterm{$id}}, $rest;
}
while (my $row2 = <SNPS>)
{
chomp($row2);
my ($id) = $row2 =~ /^(\S+)/;
for my $rest (@{ $goterm{$id} })
{
print("$row2$rest\n");
}
}
close GOTERMS;
close SNPS;
Look at your output. It's clearly produced by
So the question is: How does you find the rows of the first file with the same id as a row of the second file?
The answer is: You store the rows of the first file in a hash indexed by the row's id.
my %file1;
while (<$file1_fh>) {
my ($id, $rest) = /^(\S++)(.*)/s;
push @{ $file1{$id} }, $rest;
}
So the earlier pseudo code resolves to
while (my $row2 = <$file2_fh>) {
chomp($row2);
my ($id) = $row2 =~ /^(\S+)/;
for my $rest (@{ $file1{$id} }) {
print("$row2$rest");
}
}
#!/usr/bin/env perl
use strict;
use warnings;
open(my $GOTERMS, $ARGV[0])
or die("Error opening GO terms file \"$ARGV[0]\": $!\n");
open(my $SNPS, $ARGV[1])
or die("Error opening SNP file \"$ARGV[1]\": $!\n");
my %goterm;
while (<$GOTERMS>) {
my ($id, $rest) = /^(\S++)(.*)/s;
push @{ $goterm{$id} }, $rest;
}
while (my $row2 = <$SNPS>) {
chomp($row2);
my ($id) = $row2 =~ /^(\S+)/;
for my $rest (@{ $goterm{$id} }) {
print("$row2$rest");
}
}