I'm trying to interpolate the double-quoted strings as defined in the PO file format format.
After some testing I could determine that the only recognized escape sequences are \a
, \b
, \f
, \n
, \r
, \t
, \v
, \\
, \"
, \xh...
, \o
, \oo
, and \ooo
(with h
and o
standing for hexadecimal and octal digits respectively).
Now suppose that you have the following file.po
(the double-quoted string of the msgstr
line is the one of interest):
msgid "test"
msgstr "BEL \a BS \b FF \f LF \n CR \r HT \t VT? \v BSOL \\ QUOT \" HEX? \x01009 Y2 \1712"
I wrote a function to interpolate those strings, leveraging eval
:
sub uncstring {
local $_ = substr((shift), 1, -1); # trim surrounding double-quotes
s/
\\ ( [abfnrtv\\"] | x [[:xdigit:]]+ | [0-7]{1,3} )
/
eval "\"\\$1\""
/xge;
return $_;
}
When I use it with file.po
:
perl -MB -lne '
BEGIN{ sub uncstring { ... } }
print B::cstring(uncstring($1)) if /^msgstr (".*")/;
' file.po
I get (after escaping the output with B::cstring
):
"BEL \a BS \b FF \f LF \n CR \r HT \t VT? v BSOL \\ QUOT \" HEX? \001009 Y2 y2"
In comparison, when I use the GetText utility msgexec
:
msgexec -i file.po 0 | perl -MB -lp0e '$_ = B::cstring($_);'
I get (after escaping the output with B::cstring
):
"BEL \a BS \b FF \f LF \n CR \r HT \t VT? \v BSOL \\ QUOT \" HEX? \t Y2 y2"
As you can see, both ouput differ for the escape sequences marked by VT?
and HEX?
.
How can I fix my uncstring
function for it to interpret the escape sequences like GetText does?
There were two problems with using eval
to decode the PO strings:
Perl doesn't know about \v
(thanks @choroba), so eval
converts it to v
instead of a literal VT.
GetText reads \x
escape sequences of any length, but it only keeps the least significant byte (like in standard C); for eg. \x01009
is equivalent to \x09
and is translated to a literal HT.
Here's the final code of a function that decodes PO strings; it uses a hash to store the translations of the single-char escape sequences, and uses the hex
& oct
functions to convert the hexadecimal and octal escape sequences into an integer; the obtained number is then stripped down to its least significant byte before being translated to a literal character.
use 5.10;
use feature qw{state switch};
sub po_unqqbackslash {
# Associate the "alpha char" of a single-char escape sequence
# to the "decoded value" of that escape sequence:
state $decode = {
"a" => "\a",
"b" => "\b",
"f" => "\f",
"n" => "\n",
"r" => "\r",
"t" => "\t",
"v" => "\x0B",
"\\" => "\\",
"\"" => "\"",
};
# Run a global substitution on the string argument (trimmed
# of its leading and trailing double-quotes).
return substr(shift, 1, -1) =~ s{
\\ (?:
(?<chr> [abfnrtv\\"] ) | # single-char escape sequence
x (?<hex> [[:xdigit:]]+ ) | # hexadecimal escape sequence
(?<oct> [0-7]{1,3} ) # octal escape sequence
)
} [
# NOTE: the captured group name is in fact the only key of %+
my ($group) = %+;
given($group) {
when('chr') { $decode->{$+{$group}} }
when('hex') { chr(0xFF & hex $+{$group}) }
when('oct') { chr(0xFF & oct $+{$group}) }
}
]xager;
}
note: A special thanks to @ikegami and @briandfoy for their tips