Search code examples
perlencodingutf-8cp1251

Perl: converting from cp1251 to utf8


I try to convert string to utf8.

#!/usr/bin/perl -w
use Encode qw(encode decode is_utf8);
$str = "\320\300\304\310\323\321 \316\320\300\312\313";
Encode::from_to($str, 'windows-1251', 'utf-8');
print "converted:\n$str\n";

And in this case I get what I need:

# ./convert.pl
converted:
РАДИУС ОРАКЛ

But if I use external variable:

#!/usr/bin/perl -w
use Encode qw(encode decode is_utf8);
$str = $ARGV[0];
Encode::from_to($str, 'windows-1251', 'utf-8');
print "converted:\n$str\n";

Nothing happens.

# ./convert.pl "\320\300\304\310\323\321 \316\320\300\312\313"
 converted:
\320\300\304\310\323\321 \316\320\300\312\313

This is the dump of the first example:

SV = PV(0x1dceb78) at 0x1ded120
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x1de7970 "\320\300\304\310\323\321 \316\320\300\312\313"\0
CUR = 12
LEN = 16

And the second:

SV = PV(0x1c1db78) at 0x1c3c110
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x1c5e7e0 "\\320\\300\\304\\310\\323\\321 \\316\\320\\300\\312\\313"\0
CUR = 45
LEN = 48

I've tried this method:

#!/usr/bin/perl -w
use Devel::Peek;
$str = pack 'C*', map oct, $ARGV[0] =~ /\\(\d{3})/g;
print Dump ($str);

# ./convert.pl "\320\300\304\310\323\321 \316\320\300\312\313"

SV = PV(0x1c1db78) at 0x1c3c110
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x1c5e7e0 "\320\300\304\310\323\321\316\320\300\312\313"\0
CUR = 11
LEN = 48

But again it's not what I need. Could you help me to get the result like in the first script?


After using this

($str = shift) =~ s/\\([0-7]+)/chr oct $1/eg

as suggested by Borodin, I get this

SV = PVMG(0x13fa7f0) at 0x134d0f0
  REFCNT = 
  FLAGS = (SMG,POK,pPOK)
  IV = 0
  NV = 0
  PV = 0x1347970 "\320\300\304\310\323\321 \316\320\300\312\313"\0
  CUR = 12
  LEN = 16
  MAGIC = 0x1358290 
    MG_VIRTUAL = &PL_vtbl_mglob
    MG_TYPE = PERL_MAGIC_regex_global(g)
    MG_LEN = -1

Solution

  • It's not clear exactly what input you're getting or where from, or what you want your output to be, but you shouldn't be encoding your data into UTF-8 for use within the program because you want to deal with characters and not encoded bytes. You should just decode it from whatever external encoding is being sent to the program and work with it like that

    It sounds like the input is Windows-1251 and the output is UTF-8 (?) and I assume the backslashes are a distraction. There are no backslashes in the file or typed on the keyboard are there? So changing the base to hex for clarity, your input string is like this

    "\xD0\xC0\xC4\xC8\xD3\xD1\x20\xCE\xD0\xC0\xCA\xCB"
    

    and you want to convert it to a Perl character string, do some stuff with it, and print it to the output. If you're on a Linux machine and you want to explicitly decode it from raw input bytes, then you need to write something like this

    use utf8;
    use strict;
    use warnings;
    use feature 'say';
    
    use open qw/ :std OUT :encoding(UTF-8) /;
    use Encode qw/ decode /;
    
    my $str = "\xD0\xC0\xC4\xC8\xD3\xD1\x20\xCE\xD0\xC0\xCA\xCB";
    
    $str = decode('Windows-1251', $str);
    
    say $str;
    

    output

    РАДИУС ОРАКЛ
    

    But that's a contrived situation. The string is actually coming from an input stream, so it's better to set the encoding of the stream and forget about manual decoding. You can use binmode if you're reading from STDIN, like this

    binmode STDIN, 'encoding(Windows-1251)';
    

    and then text input from STDIN will be converted implicitly from Windows-1251-encoded bytes to a character string. Alternatively, if you're opening a file on your own handle, you can put the encoding in the open call

    open my $fh, '<:encoding(Windows-1251)', $file or die $!;
    

    and then you don't need to add a binmode either

    As I said, I've assumed your output is UTF-8, and in the program above the line

    use open qw/ :std OUT :encoding(UTF-8) /;
    

    sets all output file handles to have a default of UTF-8 encoding. The :std also sets the built-in handles STDOUT and STDERR to UTF-8. If this isn't what you want and you can't figure out how to set it up as you need it then please do ask