So I was trying to scrape an entire page. I expected both to work fine. So this is the code that doesn't work:
import aiohttp
import asyncio
url = "https://unsplash.com/s/photos/dogs"
async def main():
async with aiohttp.ClientSession() as s:
async with s.get(url) as r:
enc = str(r.get_encoding())
bytes = await r.read() <--- returns <class 'bytes'>
with open("stuff.html", "w") as f:
f.write(bytes.decode(encoding=enc, errors="ignore")) <-- in errors I've tried all possible accepted values.
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
This results in an UnicodeEncodeError: 'charmap' codec can't encode character '\u2713' in position 58100: character maps to <undefined>
. Which I'm assuming is a character in the stated position that for one reason or another can't be decoded and converted into a string.
By modifying the main
function to the following, it works fine.
async def main():
async with aiohttp.ClientSession() as s:
async with s.get(url) as r:
enc = str(r.get_encoding())
bytes = await r.read()
with open("stuf.html", "wb") as f:
f.write(bytes)
I'm not sure why it won't work. Because in the second code block, I'm just writing the bytes to a file called stuff.html
with the context manager. And in the first code block. I'm just taking a longer way of doing the same thing with the decode()
method, to well decode it and turn it into a string to be written to a file. So I don't need to open the file with the wb
or w
, etc.
f.write(string)
encodes a string to bytes before the actual writing using system-default encoding if the explicit encoding was not set in open()
call.
On Windows, the filesystem encoding is charmap
by default (see locale.getpreferredencoding()); not utf-8
. Charmap cannot encode all utf-8 characters, that's why you see the error.
There is a discussion about switching Windows default encoding to utf-8 but the switch adds backward compatibility problems and thus not performed yet.
The current file encoding state is described in Python Docs for Windows.