Search code examples
pythonutf-8seddiacriticsiso-8859-15

Replacing German Umlauts in an ISO 8859-15 file on an UTF 8 system


I have a bunch of CSV files that I read and plot with python and pandas.

To add some more information about the file (or rather, the data it is about) into my plots, I am analyzing their headers, to extract various things from it (location of the measurement point, type of measurement etc.).

Problem is - the files are in German and thus contain a lot of umlauts (ü, ö, ä). Now I can read and understand them perfectly fine, but my script can't.

So I want to simply replace them with their valid 2 character representations (ü=ue, …), so that I dont have to worry about using things like u'Ümlautstring' or \xfcstring in python.

sed -i 's/\ä/ae/g' myfile.csv

should do the trick, according to google, but it doesnt work.

With some further resarch, I found the issue, but no solution:

My csv files are encoded in ISO 8859-15, but my locale is LANG=de_DE.UTF-8, which, as far as I understand it, means that sed searches for ü in its utf 8 form, which it will not find in ISO 8859-15.

So what do I have to tell sed to find my umlauts?

Most things I have found so far suggest Perl, but that is not really an option.


Solution

  • You can use the LC_* envvars to prevent sed from doing any UTF-8 interpretation and \x escape sequences to specify the umlaut characters by their hex value in ISO-8859-15. Long story short,

    LC_ALL=C sed 's/\xc4/Ae/g;s/\xd6/Oe/g;s/\xdc/Ue/g;s/\xe4/ae/g;s/\xf6/oe/g;s/\xfc/ue/g;s/\xdf/ss/g' filename
    

    should work for all of ÄÖÜäöüß, which I'm guessing are the ones you care about.