Search code examples
pythonutf-8web-scrapingbeautifulsoupcp1252

How to convert cp1252 to UTF-8 when export csv file using python


I have Unicode error when I tried to export the CSV file (web-scraping, I'm using Beautifulsoup and imported both CSV and Beautifulsoup). The code is used by Mac Linux which quite supports the UTF-8 but I'm using Windows. The error shows as

> UnicodeEncodeError Traceback (most recent call last) in () 71
> 'ranking_title': ranking_title, ---> 72 'ranking_category':
> ranking_category}) 73
> 
> ~\Anaconda3\lib\csv.py in writerow(self, rowdict) 154 def
> writerow(self, rowdict): --> 155 return
> self.writer.writerow(self._dict_to_list(rowdict)) 156
> 
> ~\Anaconda3\lib\encodings\cp1252.py in encode(self, input, final) 18
> def encode(self, input, final=False): ---> 19 return
> codecs.charmap_encode(input,self.errors,encoding_table)[0] 20
> 
> UnicodeEncodeError: 'charmap' codec can't encode characters in
> position 299-309: character maps to

The original code that works for Mac is:

def get_page(url):
    request = urllib.request.Request(url)
    response = urllib.request.urlopen(request)
    mainpage = response.read().decode('utf8')
    return mainpage

I tried decode the cp1252 and encode the UTF-8 at the beginning of the worksheet:

def get_page(url):
    request = urllib.request.Request(url)
    response = urllib.request.urlopen(request)
    mainpage = response.read().decode('cp1252').encode('utf8')
    return mainpage

But it doesn't work.Please help.


Solution

  • The UnicodeEncodeError you are facing occurs when you write the data to the CSV output file. As the error message tells us, Python uses a "charmap" codec which doesn't support the characters contained in your data. This usually happens when you open a file without specifying the encoding parameter on a Windows machine.

    In the attached code document (comment link), snippet no. 10, we can see that this is the case. You wrote:

    with open('wongnai.csv', 'w', newline='') as record:
        fieldnames = ...
    

    In this case, Python uses a platform-dependent default encoding, which is usually some 8-bit encoding on Windows machines. Specify a codec that supports all of Unicode, and writing the file should succeed:

    with open('wongnai.csv', 'w', newline='', encoding='utf16') as record:
        fieldnames = ...
    

    You can also use "utf8" or "utf32" instead of "utf16", of course. UTF-8 is very popular for saving files in Unix environments and on the Internet, but if you are planning to open the CSV file with Excel later on, you might face some trouble to get the application to display the data properly. A more Windows-proof (but technically non-standard) solution is to use "utf-8-sig", which adds some semi-magic character to the beginning of the file for helping Windows programs understand that it's UTF-8.