Search code examples
pythoncsvpython-2.7utf-8encode

Encoding text to file / reading encoded text correctly to eradicate the  symbol?


Basically in my student data I am having an issue where by I am getting weird sumbols in my data as you can see: MAIN £1.00when it should show MAIN £1.00

Below is a snippet of my code what scrapes a website for certain student information for their student discounts and eventually writes it to file.

# -*- coding: utf-8 -*-             
totals = main.find_all('p')
for total in totals:
    if total .find(text=re.compile("Main:")):
        total = total.get_text()
        if u"Main £" in total:
            pull1 = re.search(r'(MAIN) (\D\w+\D\d+)', total)
            pull2 = re.search(r'(MAINER) (\D\w+\D\d+)', total)
            if pull1:
                rpr_data.append(pull1.group(0).title())
                print pull1.group(0).title()
            if pull2:
                rpr_data.append(pull2.group(0).title())
                print pull2.group(0).title()
with open('RPR.txt','w') as rpr_file:
    rpr_file.write('\n'.join(rpr_data).encode("UTF-8"))

When I try and re-use this data in the script Matching three variables from textfile to csv and writing variables to the csv on matched rows even though the data in the text file has no weird  symbol when it writes to CSV the symbol comes back...

How can I permanently eradicate this  symbol correctly?


Solution

  • Getting extra  characters before various western-european characters is almost always a sign of interpreting UTF-8 as Latin-1 (or cp1252 or some other "extended Latin-1" charset).*

    That could be you receiving UTF-8 input and trying to process it as Latin-1, or you generating UTF-8 output that someone else is trying to process as Latin-1.


    If you're seeing these in the output file, the most likely possibility is that your code is doing everything right every step of the way, and generating a perfectly good UTF-8 file… and then you're trying to view that file on a Windows machine whose OEM code page is 1252 in a program like Notepad that defaults to the OEM code page.

    If that's it, there are two possibilities:

    1. Don't do that. View the file as UTF-8. You can tell Notepad to open a file as UTF-8 instead of the default. Or you can use a different editor/viewer.

    2. If you want the file to be viewable as cp1252, or as "whatever the OEM code page is on this machine", save it that way—e.g., change the last line to use encode("cp1252").


    If you're seeing them in the print statements, the most likely possibility is that your code is doing everything right, but your terminal is a Windows DOS prompt that's again set to code page 1252. See Python, Unicode, and the Windows console and Windows cmd encoding change causes Python crash for all the different things that can be wrong here and how to work around them.


    * You can see this from a quick line of Python: u'\u00a3'.encode('utf-8').decode('latin-1') == u'\u00c2\u00a3'. That u'\u00c2' is Â. Going the other way can never cause this problem: u'\u00a3'.encode('latin-1').decode('utf-8') will instead raise a UnicodeDecodeError.