I'm trying to interpolate the double-quoted strings as defined in the PO file format format.
After some testing I could determine that the only recognized escape sequences are \a
, \b
, \f
, \n
, \r
, \t
, \v
, \\
, \"
, \xh...
, \o
, \oo
, and \ooo
(with h
and o
standing for hexadecimal and octal digits respectively).
Now suppose that you have the following file.po
(the double-quoted string of the msgstr
line is the one of interest):
msgid "test"
msgstr "BEL \a BS \b FF \f LF \n CR \r HT \t VT? \v BSOL \\ QUOT \" HEX? \x01009 Y2 \1712"
I wrote a function to interpolate those strings, leveraging eval
:
sub uncstring {
local $_ = substr((shift), 1, -1); # trim surrounding double-quotes
s/
\\ ( [abfnrtv\\"] | x [[:xdigit:]]+ | [0-7]{1,3} )
/
eval "\"\\$1\""
/xge;
return $_;
}
When I use it with file.po
:
perl -MB -lne '
BEGIN{ sub uncstring { ... } }
print B::cstring(uncstring($1)) if /^msgstr (".*")/;
' file.po
I get (after escaping the output with B::cstring
):
"BEL \a BS \b FF \f LF \n CR \r HT \t VT? v BSOL \\ QUOT \" HEX? \001009 Y2 y2"
In comparison, when I use the GetText utility msgexec
:
msgexec -i file.po 0 | perl -MB -lp0e '$_ = B::cstring($_);'
I get (after escaping the output with B::cstring
):
"BEL \a BS \b FF \f LF \n CR \r HT \t VT? \v BSOL \\ QUOT \" HEX? \t Y2 y2"
As you can see, both ouput differ for the escape sequences marked by VT?
and HEX?
.
How can I fix my uncstring
function for it to interpret the escape sequences like GetText does?
There were two problems with using eval
for un-escaping the PO strings:
Perl doesn't know about \v
(thanks @choroba), so eval
converts it to v
instead of a literal VT.
GetText reads \x
escape sequences of any length, but it only keeps the least significant byte (like in standard C); for eg. \x01009
is equivalent to \x09
and is translated to a literal HT.
Here's a (hopefully) working function that decodes PO strings; it uses a hash to store the translations of the single-char escape sequences, and uses the hex
& oct
functions to convert the hexadecimal and octal escape sequences into an integer; the obtained number is then stripped down to its least significant byte before being translated to a literal character.
my %tr = (
"a" => "\a",
"b" => "\b",
"f" => "\f",
"n" => "\n",
"r" => "\r",
"t" => "\t",
"v" => "\x0B",
"\\" => "\\",
"\"" => "\""
);
sub po_unqqbackslash {
return substr(shift, 1, -1) =~
s/
\\ (?: ([abfnrtv\\"]) | x ([[:xdigit:]]+) | ([0-7]{1,3}) )
/
$1 ? $tr{$1} : chr(0xFF & ($2 ? hex $2 : oct $3))
/xager
;
}
note: A special thanks to @ikegami for his tips