Search code examples
perltext-manipulationmerging-datacsv

Perl script for combining 2 files with multiple entries


I have a tab-delimited text file like this:

contig11 GO:100 other columns of data
contig11 GO:289 other columns of data
contig11 GO:113 other columns of data
contig22 GO:388 other columns of data
contig22 GO:101 other columns of data

And another like this:

contig11 3 N
contig11 1 Y
contig22 1 Y
contig22 2 N

I need to combine them so that each 'multiple' entry of one of the files is duplicated and populated with its data in the other, so that I get:

contig11 3 N GO:100 other columns of data
contig11 3 N GO:289 other columns of data
contig11 3 N GO:113 other columns of data
contig11 1 Y GO:100 other columns of data
contig11 1 Y GO:289 other columns of data
contig11 1 Y GO:113 other columns of data
contig22 1 Y GO:388 other columns of data
contig22 1 Y GO:101 other columns of data
contig22 2 N GO:388 other columns of data
contig22 2 N GO:101 other columns of data

I have little scripting experience, but have done this where e.g. "contig11" occurs only once in one of the files, with hashes/keys. But I can't even begin to get my head around to do this! Really appreciate some help or hints as to how to tackle this problem.

EDIT So I have tried ikegami's suggestion (see answers) with this: However, this has produced the output I needed except the GO:100 column onwards ($rest in script???) - any ideas what I'm doing wrong?

#!/usr/bin/env/perl

use warnings;

open (GOTERMS, "$ARGV[0]") or die "Error opening the input file with GO terms";
open (SNPS, "$ARGV[1]") or die "Error opening the input file with SNPs";

my %goterm;

while (<GOTERMS>)
{
    my($id, $rest) = /^(\S++)(,*)/s;
    push @{$goterm{$id}}, $rest;
}

while (my $row2 = <SNPS>)
{
    chomp($row2);
    my ($id) = $row2 =~ /^(\S+)/;
    for my $rest (@{ $goterm{$id} })
    {
        print("$row2$rest\n");
    }
}

close GOTERMS;
close SNPS;

Solution

  • Look at your output. It's clearly produced by

    • for each row of the second file,
      • for each row of the first file with the same id,
        • print out the combined rows

    So the question is: How does you find the rows of the first file with the same id as a row of the second file?

    The answer is: You store the rows of the first file in a hash indexed by the row's id.

    my %file1;
    while (<$file1_fh>) {
       my ($id, $rest) = /^(\S++)(.*)/s;
       push @{ $file1{$id} }, $rest;
    }
    

    So the earlier pseudo code resolves to

    while (my $row2 = <$file2_fh>) {
       chomp($row2);
       my ($id) = $row2 =~ /^(\S+)/;
       for my $rest (@{ $file1{$id} }) {
          print("$row2$rest");
       }
    }
    

    #!/usr/bin/env perl
    
    use strict;   
    use warnings;
    
    open(my $GOTERMS, $ARGV[0])
         or die("Error opening GO terms file \"$ARGV[0]\": $!\n");
    open(my $SNPS, $ARGV[1])
         or die("Error opening SNP file \"$ARGV[1]\": $!\n");
    
    my %goterm;
    while (<$GOTERMS>) {
        my ($id, $rest) = /^(\S++)(.*)/s;
        push @{ $goterm{$id} }, $rest;
    }
    
    while (my $row2 = <$SNPS>) {
        chomp($row2);
        my ($id) = $row2 =~ /^(\S+)/;
        for my $rest (@{ $goterm{$id} }) {
            print("$row2$rest");
        }
    }