Search code examples
perlfileformatalterphylogeny

Alter a file using information from another file


I want to alter the names in a phylip file using information from another file. The phylip is just one continuous string of information, and the names I want to alter (e.g. aaaaaaabyd) are embedded in it. Like so

((aaaaaaabyd:0.23400159127856412500,(((aaaaaaaaxv:0.44910864993667892753,aaaaaaaagf:0.51328033054009691849):0.06090419044604544752,((aaaaaaabyc:0.11709094683204501752,aaaaaaafzz:0.04488198976629347720):0.09529995111708353117,((aaaaaaadbn:0.34408087090010841536,aaaaaaaafj:0.47991503739434709930):0.06859184769990583908,((aaaaaaaabk:0.09244297511609228524,aaaaaaaete:0.12568841555837687030):0.28431

(there are no new lines)

The names within are like aaaaaaaabk.

The other file has the information change to, like so in the other file,

aaaaaaaabk;Ciona savignyi
aaaaaaaete;Homo sapiens
aaaaaaaafj;Cryptosporidium hominis
aaaaaaaaad;Strongylocentrotus purpuratus
aaaaaaabyd;Theileria parva
aaaaaaaaaf;Plasmodium vivax

I have tried numerous things but this is the closest I got. The problem is it does it for one and doesn't print out the rest of the phylip file. I need to get to ((Theileria parva:0.23400159127856412500, etc.

open(my $tree, "$ARGV[0]") or die "Failed to open file: $!\n";
open(my $csv,  "$ARGV[0]") or die "Failed to open file: $!\n";
open(my $new_tree, "> raxml_tree.phy");

# Declare variables
my $find;
my $replace;
my $digest;

# put the file of the tree into string variable
my $string = <$tree>;

# open csv file
while (my $line = <$csv>) {

    # aaaaaaaaaa;Ciona savignyi

    if ($line =~ m/(\w+)\;+(\w+\s+\w*)/) {
        $find    = $1;
        $replace = $2;
        $string =~ s/$find/$replace/g;
    }
}
print $new_tree "$string";

close $tree;
close $csv;
close $new_tree;

Solution

  • Some guidelines on your own code

    • The problem is almost certainly that you are opening the same file $ARGV[0] twice. Presumably one should be `$ARGV[1]

    • You must always use strict and use warnings at the top of every Perl program you write (there is very little point in declaring your variables unless use strict is in place) and declare all your variables with my as close as possible to their first point of use. It is bad form to declare all your variables in a block at the start, because it makes them all effectively global, and you lose most of the advantages of declaring lexical variables

    • You should use the three-parameter form of open, and it is a good idea to put the name of the file in the die string so that you can see which one failed. So

      open(my $tree, "$ARGV[0]") or die "Failed to open file: $!\n";
      

      becomes

      open my $tree, '<', $ARGV[0] or die qq{Failed to open "$ARGV[0]" for input: $!\n};
      
    • You should look for simpler solutions rather than apply regex methods every time. $line =~ m/(\w+)\;+(\w+\s+\w*)/ is much tidier as chomp, split /;/

    • You shouldn't use double-quotes around variables when you want just the value of the variable, so print $new_tree "$string" should be print $new_tree $string

    Rather than trying to use the data from the other file (please try to use useful names for items in your question, as it's tough to know what to call them when writing a solution) it is best to build a hash that contains all the translations

    This program will do as you ask. It builds a regex consisting of an alternation of all the hash keys, and then converts all ocurrences of that pattern into its corresponding name. Only those names that are in your sample other file are translated: the others are left as they are

    use strict;
    use warnings;
    use 5.014;  # For non-destructive substitution
    use autodie;
    
    my %names;
    open my $fh, '<', 'other_file.txt';
    while ( <$fh> ) {
      my ($k, $v) = split /;/, s/\s+\z//r;
      $names{$k} = $v;
    }
    
    open $fh, '<', 'phylip.txt';
    my $data = <$fh>;
    close $fh;
    
    my $re = join '|', sort { length $b <=> length $a } keys %names;
    $re = qr/(?:$re)/;
    $data =~ s/\b($re)\b/$names{$1}/g;
    
    print $data;
    

    output

    ((Theileria parva:0.23400159127856412500,(((aaaaaaaaxv:0.44910864993667892753,aaaaaaaagf:0.51328033054009691849):0.06090419044604544752,((aaaaaaabyc:0.11709094683204501752,aaaaaaafzz:0.04488198976629347720):0.09529995111708353117,((aaaaaaadbn:0.34408087090010841536,Cryptosporidium hominis:0.47991503739434709930):0.06859184769990583908,((Ciona savignyi:0.09244297511609228524,Homo sapiens:0.12568841555837687030):0.28431
    

    Update

    Here is a revised version of your own program with the above points accounted for and the bugs fixed

    use strict;
    use warnings;
    
    open my $tree_fh, '<', $ARGV[0] or die qq{Failed to open "$ARGV[0]" for input: $!\n};
    my $string = <$tree_fh>;
    close $tree_fh;
    
    open my $csv_fh,  '<', $ARGV[1] or die qq{Failed to open "$ARGV[1]" for input: $!\n};
    while ( <$csv_fh> ) {
        chomp;
        my ($find, $replace) = split /;/;
        $string =~ s/$find/$replace/g;
    }
    close $csv_fh;
    
    open my $new_tree_fh, '>', 'raxml_tree.phy' or die qq{Failed to open "raxml_tree.phy" for output: $!\n};
    print $new_tree_fh $string;
    close $new_tree_fh;