Search code examples
pythondecodeiso-8859-1non-ascii-characters

python - Converting a non-ascii character to a specific character



Hello everyone and thanks for your time !

I have iso-8859-1 html files as input, with html entities in place of non-ascii characters, which is neat. Except for only one character : œ (00C9 unicode codepoint, just in case it doesnt display). I want to convert it to "oe" to get rid of it.

I already tried iconv -f iso-8859-1 -t ascii//translit

but it rips off the problematic character and doesnt put anything in its place.

I work with python 2.7 and I tried several things around decode, encode, codecs, but I'm not going anywhere. Here is my code at this point :

i=0
for file in os.listdir(dir_in):
i+=1
file=codecs.open(dir_in+file,"r","iso-8859-1")
out=codecs.open(dir_out+str(i)+".html","w","utf-8")
    for line in file:
            #at this point the type of line is "unicode"
    line=line.decode("iso-8859-1",errors="replace")
            out.write(line)
file.close
out.close

(I have trouble making the indent display correctly but I assure you this part is fine) I get an "ascii codec cant encode character u\x9c" error. I'm not sure if I'm using decode appropriately.

I also tried :

line=unicode(line)

which gets rid of the character without a replacement (which is what its supposed to do I guess)

line=unicode(line, errors="replace")

which gives me "TypeError : decoding Unicode is not supported" I suppose those two didnt work because I'm not supposed to give to "unicode" an already unicode thing.

If you got a simple method to do it in bash or in perl I'm interested too, but I cant use python 3 as it's not supported by the server which will have to run the thing.

Thanks a lot !


Solution

  • Could you just replace the character in question prior to trying to write it?:

    line = line.replace(u"\x9c", "oe")