Given pairs of string like this.
my $s1 = "ACTGGA";
my $s2 = "AGTG-A";
# Note the string can be longer than this.
I would like to find position and character in in $s1
where it differs with $s2
In this case the answer would be:
#String Position 0-based
# First col = Base in S1
# Second col = Base in S2
# Third col = Position in S1 where they differ
C G 1
G - 4
I can achieve that easily with substr()
. But it is horribly slow.
Typically I need to compare millions of such pairs.
Is there a fast way to achieve that?
Stringwise ^ is your friend:
use strict;
use warnings;
my $s1 = "ACTGGA";
my $s2 = "AGTG-A";
my $mask = $s1 ^ $s2;
while ($mask =~ /[^\0]/g) {
print substr($s1,$-[0],1), ' ', substr($s2,$-[0],1), ' ', $-[0], "\n";
The ^
(exclusive or) operator, when used on strings, returns a string composed of the result of an exclusive or on each bit of the numeric value of each character. Breaking down an example into equivalent code:
"AB" ^ "ab"
( "A" ^ "a" ) . ( "B" ^ "b" )
chr( ord("A") ^ ord("a") ) . chr( ord("B") ^ ord("b") )
chr( 65 ^ 97 ) . chr( 66 ^ 98 )
chr(32) . chr(32)
" " . " "
" "
The useful feature of this here is that a nul character ("\0"
) occurs when and only when the two strings have the same character at a given position. So ^
can be used to efficiently compare every character of the two strings in one quick operation, and the result can be searched for non-nul characters (indicating a difference). The search can be repeated using the /g regex flag in scalar context, and the position of each character difference found using $-[0]
, which gives the offset of the beginning of the last successful match.