Here is the gist of my code. It's trying to get some text from an old website. It's not mine, so I can't change the source.
from bs4 import BeautifulSoup
import requests
response = requests.get("https://mattgemmell.com/network-link-conditioner-in-lion/")
data = response.text
soup = BeautifulSoup(data, 'lxml')
article = soup.find_all('article')[0]
text = article.find_all('p')[1].text
print(text)
Gives this:
'If youâ\x80\x99re a developer of either Mac or iOS apps that use networking, thereâ\x80\x99s a new feature in the Developer Tools for Mac OS X 10.7 â\x80\x9cLionâ\x80\x9d (read my review of it at The Guardian) which will be useful to you. This brief article describes how it works.'
I can use this to convert parts like â\x80\x99:
converted_text = bytes(text, 'latin-1').decode('utf-8')
Actually works.
But if you get a different part of the text:
text = article.find_all('p')[8].text
Gives me:
'\n← Find Patterns in text on Lion\nUsing Spaces on OS X Lion →\n'
And using bytes(text, 'latin-1')
gives me:
'latin-1' codec can't encode character '\u2190' in position 1: ordinal not in range(256)
I assume it's the arrows? How can I make it so all non latin characters are automatically ignored and discarded.
Any ideas would be most helpful!
You don't want to ignore these characters. They are a symptom that the data you received has been decoded using the wrong character encoding. In your case requests
has incorrectly guessed that the encoding is latin-1
. The real encoding is utf-8
and is specified in a <meta>
tag in the HTML response. requests
is a library for working with HTTP, it doesn't know about HTML. Since the Content-Type
header doesn't specify the encoding requests
resorted to guessing the encoding. BeautifulSoup
, however, is a library for working with HTML and it is very good at detecting encodings. As such, you want to get the raw bytes from the response and pass this to BeautifulSoup
. ie.
from bs4 import BeautifulSoup
import requests
response = requests.get("https://mattgemmell.com/network-link-conditioner-in-lion/")
data = response.content # we now get `content` rather than `text`
assert type(data) is bytes
soup = BeautifulSoup(data, 'lxml')
article = soup.find_all('article')[0]
text = article.find_all('p')[1].text
print(text)
assert type(text) is str
assert 'Mac OS X 10.7 “Lion”' in text