I am new to perl. I have a requirement, where I have to convert UTF-8 characters in a string to hyphen(-).
Input string - "IVM IST 20150324095652 31610150096 10ÑatÑ25ÑDisco 0000000091"
Expected output - "IVM IST 20150324095652 31610150096 10-at-25-Disco 0000000091".
But the below program which I have written, reads UTF-8 char as two separate bytes and getting the output as "10--at--25--Disco"
[root@ cdr]# cat ../asciifilter.pl
#!/usr/bin/perl
use strict;
use Encode;
my @chars;
my $character;
my $num;
while(my $row = <>) {
@chars = split(//,$row);
foreach $character (@chars) {
$num = ord($character);
if($num < 127) {
print $character;
} else {
print "-";
}
}
}
Output:
[root@MAVBGL-351L cdr]# echo "IVM IST 20150324095652 31610150096 10ÑatÑ25ÑDisco 0000000091" | ../asciifilter.pl
IVM IST 20150324095652 31610150096 10--at--25--Disco 0000000091
But this particular 4th string column has a fixed length of 14 characters only.So that additional hyphens are creating problem.
Can someone give me some clue on how to read UTF-8 char as single character ?
The main thing you need is perl -CSD
. With that, the script can be as simple as
perl -CSD -pe 's/[^\x00-\x7F]/-/g'
See man perlrun for a discussion of the options; but briefly, -CS
means STDIN
, STDOUT
, and STDERR
are in UTF-8; and -CD
means UTF-8 is the default PerlIO layer for both input and output streams. (This script only uses STDIN
and STDOUT
so the D
isn't strictly necessary; but if you only learn one magic incantation, learn -CSD
.)