python web-scraping beautifulsoup html-parsing

Python 'latin-1' codec can't encode character - How to ignore characters?

Here is the gist of my code. It's trying to get some text from an old website. It's not mine, so I can't change the source.

from bs4 import BeautifulSoup
import requests

response = requests.get("https://mattgemmell.com/network-link-conditioner-in-lion/")
data = response.text
soup = BeautifulSoup(data, 'lxml')
article = soup.find_all('article')[0]
text = article.find_all('p')[1].text 
print(text)

Gives this:

'If youâ\x80\x99re a developer of either Mac or iOS apps that use networking, thereâ\x80\x99s a new feature in the Developer Tools for Mac OS X 10.7 â\x80\x9cLionâ\x80\x9d (read my review of it at The Guardian) which will be useful to you. This brief article describes how it works.'

I can use this to convert parts like â\x80\x99:

converted_text = bytes(text, 'latin-1').decode('utf-8')

Actually works.

But if you get a different part of the text:

text = article.find_all('p')[8].text

Gives me:

'\n← Find Patterns in text on Lion\nUsing Spaces on OS X Lion →\n'

And using bytes(text, 'latin-1') gives me:

'latin-1' codec can't encode character '\u2190' in position 1: ordinal not in range(256)

I assume it's the arrows? How can I make it so all non latin characters are automatically ignored and discarded.

Any ideas would be most helpful!

Solution

You don't want to ignore these characters. They are a symptom that the data you received has been decoded using the wrong character encoding. In your case requests has incorrectly guessed that the encoding is latin-1. The real encoding is utf-8 and is specified in a <meta> tag in the HTML response. requests is a library for working with HTTP, it doesn't know about HTML. Since the Content-Type header doesn't specify the encoding requests resorted to guessing the encoding. BeautifulSoup, however, is a library for working with HTML and it is very good at detecting encodings. As such, you want to get the raw bytes from the response and pass this to BeautifulSoup. ie.

from bs4 import BeautifulSoup
import requests

response = requests.get("https://mattgemmell.com/network-link-conditioner-in-lion/")
data = response.content # we now get `content` rather than `text`
assert type(data) is bytes
soup = BeautifulSoup(data, 'lxml')
article = soup.find_all('article')[0]
text = article.find_all('p')[1].text 
print(text)

assert type(text) is str
assert 'Mac OS X 10.7 “Lion”' in text