Search code examples
perlutf-8asciinon-ascii-characters

Perl - Convert utf-8 char to hyphen - read utf-8 as single char


I am new to perl. I have a requirement, where I have to convert UTF-8 characters in a string to hyphen(-).

Input string    - "IVM IST   20150324095652 31610150096     10ÑatÑ25ÑDisco 0000000091"
Expected output - "IVM IST   20150324095652 31610150096     10-at-25-Disco 0000000091".

But the below program which I have written, reads UTF-8 char as two separate bytes and getting the output as "10--at--25--Disco"

[root@ cdr]# cat ../asciifilter.pl
#!/usr/bin/perl
use strict;
use Encode;
my @chars;
my $character;
my $num;
while(my $row = <>) {
  @chars = split(//,$row);

  foreach $character (@chars) {
    $num  = ord($character);
    if($num < 127) { 
      print $character;
    } else { 
      print "-";
    }
  }
}

Output:

  [root@MAVBGL-351L cdr]# echo "IVM IST   20150324095652 31610150096     10ÑatÑ25ÑDisco 0000000091" | ../asciifilter.pl
  IVM IST   20150324095652 31610150096     10--at--25--Disco 0000000091

But this particular 4th string column has a fixed length of 14 characters only.So that additional hyphens are creating problem.

Can someone give me some clue on how to read UTF-8 char as single character ?


Solution

  • The main thing you need is perl -CSD. With that, the script can be as simple as

    perl -CSD -pe 's/[^\x00-\x7F]/-/g'
    

    See man perlrun for a discussion of the options; but briefly, -CS means STDIN, STDOUT, and STDERR are in UTF-8; and -CD means UTF-8 is the default PerlIO layer for both input and output streams. (This script only uses STDIN and STDOUT so the D isn't strictly necessary; but if you only learn one magic incantation, learn -CSD.)