Search code examples
shellsedspecial-characterstex

Using sed to replace umlauts


I tried the following:

sed -e 's/ü/\\"u/g' filename.tex>filename2.tex

but my terminal doesn't recognise the umlaut, so replaces all u with \"u. I know that tex has packages and what-nots that might solve this problem, but I am interested in a sed way for the moment.


Solution

  • The fundamental problem is that there is a complex interaction between sed, your locale, your terminal, your shell, and the file you are operating on. Here is a list of things to try.

    • If you are lucky, your shell, sed, and the file you are working on have complete agreement on what the character you are trying to replace should be represented as. In your case, you already tried that, and it failed.

      sed 's/ü/\\"u/g' filename.tex
      
    • If you are only slightly less lucky, the other parts are fine, and it's just that your sed is not modern enough to grok the character sequence you are trying to replace. A trivial sed script like yours can be simply passed to perl instead, which usually is more up to date when it comes to character encodings.

      perl -pe 's/ü/\\"u/g' filename.tex
      

      If the character encoding is UTF-8, you may need to pass a -CSD option to Perl, and/or express the character you wish to replace with an escape of some sort. You can say \xfc for a raw hex code (that happens to be ü in Latin-1 and Latin-9) or \x{00fc} for a Unicode character, or even \N{LATIN SMALL LETTER U WITH DIAERESIS}; but notice that Unicode has several representations for this glyph (precomposed or decomposed, normalized or not). See also http://perldoc.perl.org/perlunicode.html

      (For in-place editing, perhaps you want to add the -i option, too.)

    • Finally, you may need to break down and simply figure out the raw bytes of the character code you want to replace. A few lines of hex dump of the problematic file should be helpful. After that, Perl should be able to cope, but you need to figure out how to disable character set encoding and decoding etc. If, say, you find out that the problematic sequence is 0xFF 0x03 then perl -pe 's/\xff\x03/\\"u/g' filename.tex should work.