Search code examples
pythoncperlutf-8string-literals

Good Perl style: How to convert UTF-8 C string literals to \xXX sequences


[Python people: My question is at the very end :-)]

I want to use UTF-8 within C string literals for readability and easy maintainance. However, this is not universally portable. My solution is to create a file foo.c.in which gets converted by a small perl script to file foo.c so that it contains \xXX escape sequences instead of bytes larger than or equal to 0x80.

For simplicity, I assume that a C string starts and ends in the same line.

This is the Perl code I've created. In case a byte >= 0x80 is found, the original string is emitted as a comment also.

use strict;
use warnings;

binmode STDIN, ':raw';
binmode STDOUT, ':raw';


sub utf8_to_esc
{
  my $string = shift;
  my $oldstring = $string;
  my $count = 0;
  $string =~ s/([\x80-\xFF])/$count++; sprintf("\\x%02X", ord($1))/eg;
  $string = '"' . $string . '"';
  $string .= " /* " . $oldstring . " */" if $count;
  return $string;
}

while (<>)
{
  s/"((?:[^"\\]++|\\.)*+)"/utf8_to_esc($1)/eg;
  print;
}

For example, the input

"fööbär"

gets converted to

"f\xC3\xB6\xC3\xB6b\xC3\xA4r" /* fööbär */

Finally, my question: I'm not very good in Perl, and I wonder whether it is possible to rewrite the code in a more elegant (or more 'Perlish') way. I would also like if someone could point to similar code written in Python.


Solution

    1. I think it's best if you don't use :raw. You are processing text, so you should properly decode and encode. That will be far less error prone, and it will allow your parser to use predefined character classes if you so desire.

    2. You parse as if you expect slashes in the literal, but then you completely ignore then when you escape. Because of that, you could end up with "...\\xC3\xA3...". Working with decoded text will also help here.

    So forget "perlish"; let's actually fix the bugs.

    use open ':std', ':locale';
    
    sub convert_char {
       my ($s) = @_;
       utf8::encode($s);
       $s = uc unpack 'H*', $s;
       $s =~ s/\G(..)/\\x$1/sg;
       return $s;
    }
    
    sub convert_literal {
       my $orig = my $s = substr($_[0], 1, -1);
    
       my $safe          = '\x20-\x7E';          # ASCII printables and space
       my $safe_no_slash = '\x20-\x5B\x5D-\x7E'; # ASCII printables and space, no \
       my $changed = $s =~ s{
          (?: \\? ( [^$safe] )
          |   ( (?: [$safe_no_slash] | \\[$safe] )+ )
          )
       }{
          defined($1) ? convert_char($1) : $2
       }egx;
    
       # XXX Assumes $orig doesn't contain "*/"
       return qq{"$s"} . ( $changed ? " /* $orig */" : '' );
    }
    
    while (<>) {
       s/(" (?:[^"\\]++|\\.)*+ ")/ convert_literal($1) /segx;
       print;
    }