I have a bunch of CSV files that I read and plot with python and pandas.
To add some more information about the file (or rather, the data it is about) into my plots, I am analyzing their headers, to extract various things from it (location of the measurement point, type of measurement etc.).
Problem is - the files are in German and thus contain a lot of umlauts (ü, ö, ä). Now I can read and understand them perfectly fine, but my script can't.
So I want to simply replace them with their valid 2 character representations (ü=ue, …), so that I dont have to worry about using things like u'Ümlautstring'
or \xfcstring
in python.
sed -i 's/\ä/ae/g' myfile.csv
should do the trick, according to google, but it doesnt work.
With some further resarch, I found the issue, but no solution:
My csv files are encoded in ISO 8859-15
, but my locale
is LANG=de_DE.UTF-8
, which, as far as I understand it, means that sed searches for ü
in its utf 8 form, which it will not find in ISO 8859-15.
So what do I have to tell sed to find my umlauts?
Most things I have found so far suggest Perl, but that is not really an option.
You can use the LC_*
envvars to prevent sed from doing any UTF-8 interpretation and \x
escape sequences to specify the umlaut characters by their hex value in ISO-8859-15. Long story short,
LC_ALL=C sed 's/\xc4/Ae/g;s/\xd6/Oe/g;s/\xdc/Ue/g;s/\xe4/ae/g;s/\xf6/oe/g;s/\xfc/ue/g;s/\xdf/ss/g' filename
should work for all of ÄÖÜäöüß, which I'm guessing are the ones you care about.