Search code examples
pythonpython-2.7encodingutf-8character-encoding

Python 2.7 reading and writing "éèàçê" from utf-8 file


I made this script which removes every trailing whitespace characters and replace all bad french characters by the right ones.

Removing the trailing whitespace characters works but not the part about replacing the french characters.

The file to read/write are encoded in UTF-8 so I added the utf-8 declaration above my script but in the end every bad characters (like \u00e9) are being replaced by litte square.

Any idea why?

script :

# --*-- encoding: utf-8 --*--

import fileinput
import sys

CRLF = "\r\n"

ACCENT_AIGU = "\\u00e9"
ACCENT_GRAVE = "\\u00e8"
C_CEDILLE = "\\u00e7"
A_ACCENTUE = "\\u00e0"
E_CIRCONFLEXE = "\\u00ea"

CURRENT_ENCODING = "utf-8"
#Getting filepath
print "Veuillez entrer le chemin du fichier (utiliser des \\ ou /, c'est pareil) :"
path = str(raw_input())
path.replace("\\", "/")

#removing trailing whitespace characters
for line in fileinput.FileInput(path, inplace=1):
    if line != CRLF:
        line = line.rstrip()
        print line
        print >>sys.stderr, line
    else:
        print CRLF
        print >>sys.stderr, CRLF
fileinput.close()

#Replacing bad wharacters
for line in fileinput.FileInput(path, inplace=1):
        line = line.decode(CURRENT_ENCODING)
        line = line.replace(ACCENT_AIGU, "é")
        line = line.replace(ACCENT_GRAVE, "è")
        line = line.replace(A_ACCENTUE, "à")
        line = line.replace(E_CIRCONFLEXE, "ê")
        line = line.replace(C_CEDILLE, "ç")
        line.encode(CURRENT_ENCODING)
        sys.stdout.write(line) #avoid CRLF added by print
        print >>sys.stderr, line
fileinput.close()

EDIT

the input file contains this type of text :

 * Cette m\u00e9thode permet d'appeller le service du module de tourn\u00e9e
 * <code>rechercherTechnicien</code> et retourne la liste repr\u00e9sentant le num\u00e9ro 
 * de la tourn\u00e9e ainsi que le nom et le pr\u00e9nom du technicien et la dur\u00e9e 
 * th\u00e9orique por se rendre au point d'intervention.
 * 

EDIT2

Final code if someone is interested, the first part replaces the badly encoded caracters, the second part removes all right trailing whitespaces caracters.

# --*-- encoding: iso-8859-1 --*--

import fileinput
import re

CRLF = "\r\n"

print "Veuillez entrer le chemin du fichier (utiliser des \\ ou /, c'est pareil) :"
path = str(raw_input())
path = path.replace("\\", "/")

def unicodize(seg):
    if re.match(r'\\u[0-9a-f]{4}', seg):
        return seg.decode('unicode-escape')
    return seg.decode('utf-8')

print "Replacing caracter badly encoded"
with open(path,"r") as f:
    content = f.read()
replaced = (unicodize(seg) for seg in re.split(r'(\\u[0-9a-f]{4})',content))

with open(path, "w") as o:
    o.write(''.join(replaced).encode("utf-8"))

print "Removing trailing whitespaces caracters"
for line in fileinput.FileInput(path, inplace=1):
    if line != CRLF:
        line = line.rstrip()
        print line
    else:
        print CRLF
fileinput.close()

print "Done!"

Solution

  • Not so quick, and mostly dirty, but...

    with open("enc.txt","r") as f:
        content = f.read()
    
    import re
    
    def unicodize(seg):
        if re.match(r'\\u[0-9a-f]{4}', seg):
            return seg.decode('unicode-escape')
    
        return seg.decode('utf-8')
    
    replaced = (unicodize(seg) for seg in re.split(r'(\\u[0-9a-f]{4})',content))
    
    print(''.join(replaced))
    

    Given that input file (mixing unicode escaped sequences and properly encoded utf-8 text):

     * Cette m\u00e9thode permet d'appeller le service du module de
     * tourn\u00e9e
     * <code>rechercherTechnicien</code> et retourne la liste
     * repr\u00e9sentant le num\u00e9ro 
     * de la tourn\u00e9e ainsi que le nom et le pr\u00e9nom du technicien
     * et la dur\u00e9e 
     * th\u00e9orique por se rendre au point d'intervention.
     * 
     * S'il le désire le technicien peut dormir à l'hôtel
    

    Produce that result:

     * Cette méthode permet d'appeller le service du module de
     * tournée
     * <code>rechercherTechnicien</code> et retourne la liste
     * représentant le numéro 
     * de la tournée ainsi que le nom et le prénom du technicien
     * et la durée 
     * théorique por se rendre au point d'intervention.
     * 
     * S'il le désire le technicien peut dormir à l'hôtel