Search code examples
pythonhtmlpython-3.xunicodefwrite

How to write both Chinese characters and English characters into a file (Python 3)?


I wrote a script to scrape the titles of a YouTube playlist page

Everything works fine, according to print statements, until I try to write the titles into a text file, at which point I get "UnicodeEncodeError: 'charmap' codec can't encode characters in position..."

I've tried adding "encoding='utf8'" when I open the file, and while that fixes the error, all the Chinese characters are replaced by random, gibberish characters

I also tried encoding the output string with 'replace', then decoding it, but that also just replaces all the special characters with question marks

Here is my code:

from bs4 import BeautifulSoup as BS
import urllib.request
import re

playlist_url = input("gib nem: ")

with urllib.request.urlopen(playlist_url) as response:
  playlist = response.read().decode('utf-8')
  soup = BS(playlist, "lxml")

title_attrs = soup.find_all(attrs={"data-title":re.compile(r".*")})
titles = [tag["data-title"] for tag in title_attrs]

titles_str = '\n'.join(titles)#.encode('cp1252','replace').decode('cp1252')

print(titles_str)
with open("playListNames.txt", "a") as f:
    f.write(titles_str)

And here is the sample playlist I've been using to test: https://www.youtube.com/playlist?list=PL3oW2tjiIxvSk0WKXaEiDY78KKbKghOOo


Solution

  • The documentation is clear about file encoding:

    encoding is the name of the encoding used to decode or encode the file. This should only be used in text mode. The default encoding is platform dependent (whatever locale.getpreferredencoding() returns), but any text encoding supported by Python can be used. See the codecs module for the list of supported encodings.

    To answer questions from your last comment.

    1. You can find out what's the preferred encoding on Windows with

      import locale
      locale.getpreferredencoding()
      

    If playListNames.txt was created with open('playListNames.txt', 'w') then the value returned by locale.getpreferredencoding() was used for encoding.

    If the file was created manually then the encoding depends on the editor's default/preferred encoding.

    1. Refer to How to convert a file to utf-8 in Python? or How do I convert an ANSI encoded file to UTF-8 with Notepad++? [closed].