Search code examples
regexbashsediso-8859-1

Search and replace with sed, interpreting back reference's content in order to correct corrupted ISO-8859-1 char codes


I have text files (big ones, millions of lines), originally encoded in ISO-8859-1, which got somehow corrupted, resulting in "special" characters (the ones, mapped from 0xA0 to 0xFF, that come in addition to ASCII), being replaced with their octal codes.

Example: the 'ü' character (hex: 0xFC) has been replaced by its octal code, on 4 characters: '\374'.

I have been trying to write some sed command in order to process those octal codes and have them replaced back with their corresponding original ISO-8859-1 character, but I'm missing something on the interpretation part of the 4 character code.

So far, my sed command searches for any group of 4 characters of the form \abc, where abc is a number between 000 and 377, then tries to replace it with \oabc -which is supposed to produce the ISO-8859-1 encoded character:

paul@paul:~$ sed 's,\\\([0-3][0-7][0-7]\),\\o\1,g' file

Yet, that replacement part won't work, as sed is not interpreting the \o\1 as an ISO-8859-1 code (like it does when I do sed 's/u/\o374/' file).

If my file contains:

(...) D\374sseldorf (...)

My sed command will replace it with:

(...) D\o374sseldorf (...)

Is there anyone here who could point out where I am wrong?


Solution

  • Gnu sed interprets \oxxx when it interprets the command, so it has to appear literally in the sed command. (Other seds might not interpret \oxxx at all; I don't mean to imply that they will interpret the way you propose.) As written, the \o is an invalid escape code (it's not followed by an octal number), and is therefore not replaced, while \1 is replaced by the first capture in the match.

    You can do this transformation more easily with a language like Perl which allows you to execute code to produce a replacement:

    perl -pe 's/\\([0-3][0-7][0-7])/chr(oct($1))/eg'