I wrote a script to scrape the titles of a YouTube playlist page
Everything works fine, according to print statements, until I try to write the titles into a text file, at which point I get "UnicodeEncodeError: 'charmap' codec can't encode characters in position..."
I've tried adding "encoding='utf8'" when I open the file, and while that fixes the error, all the Chinese characters are replaced by random, gibberish characters
I also tried encoding the output string with 'replace', then decoding it, but that also just replaces all the special characters with question marks
Here is my code:
from bs4 import BeautifulSoup as BS
import urllib.request
import re
playlist_url = input("gib nem: ")
with urllib.request.urlopen(playlist_url) as response:
playlist = response.read().decode('utf-8')
soup = BS(playlist, "lxml")
title_attrs = soup.find_all(attrs={"data-title":re.compile(r".*")})
titles = [tag["data-title"] for tag in title_attrs]
titles_str = '\n'.join(titles)#.encode('cp1252','replace').decode('cp1252')
print(titles_str)
with open("playListNames.txt", "a") as f:
f.write(titles_str)
And here is the sample playlist I've been using to test: https://www.youtube.com/playlist?list=PL3oW2tjiIxvSk0WKXaEiDY78KKbKghOOo
The documentation is clear about file encoding:
encoding
is the name of the encoding used to decode or encode the file. This should only be used in text mode. The default encoding is platform dependent (whateverlocale.getpreferredencoding()
returns), but any text encoding supported by Python can be used. See the codecs module for the list of supported encodings.
To answer questions from your last comment.
You can find out what's the preferred encoding on Windows with
import locale
locale.getpreferredencoding()
If playListNames.txt
was created with open('playListNames.txt', 'w')
then the value returned by locale.getpreferredencoding()
was used for encoding.
If the file was created manually then the encoding depends on the editor's default/preferred encoding.