Search code examples
pythonpython-3.xencodingaiohttppython-3.9

Unusual behavior with the built in decode method (aiohttp is used as well)


So I was trying to scrape an entire page. I expected both to work fine. So this is the code that doesn't work:

import aiohttp
import asyncio

url = "https://unsplash.com/s/photos/dogs"

async def main():
    async with aiohttp.ClientSession() as s:
        async with s.get(url) as r:
            enc = str(r.get_encoding())
            bytes = await r.read() <--- returns <class 'bytes'>
            with open("stuff.html", "w") as f:
                f.write(bytes.decode(encoding=enc, errors="ignore")) <-- in errors I've tried all possible accepted values.

loop = asyncio.get_event_loop()
loop.run_until_complete(main())

This results in an UnicodeEncodeError: 'charmap' codec can't encode character '\u2713' in position 58100: character maps to <undefined>. Which I'm assuming is a character in the stated position that for one reason or another can't be decoded and converted into a string. By modifying the main function to the following, it works fine.

async def main():
    async with aiohttp.ClientSession() as s:
        async with s.get(url) as r:
            enc = str(r.get_encoding())
            bytes = await r.read()
        with open("stuf.html", "wb") as f:
            f.write(bytes)

I'm not sure why it won't work. Because in the second code block, I'm just writing the bytes to a file called stuff.html with the context manager. And in the first code block. I'm just taking a longer way of doing the same thing with the decode() method, to well decode it and turn it into a string to be written to a file. So I don't need to open the file with the wb or w, etc.


Solution

  • f.write(string) encodes a string to bytes before the actual writing using system-default encoding if the explicit encoding was not set in open() call.

    On Windows, the filesystem encoding is charmap by default (see locale.getpreferredencoding()); not utf-8. Charmap cannot encode all utf-8 characters, that's why you see the error.

    There is a discussion about switching Windows default encoding to utf-8 but the switch adds backward compatibility problems and thus not performed yet.

    The current file encoding state is described in Python Docs for Windows.