I am trying to get all paragraphs from a couple of websites using a for loop but I am getting an empty dataframe. The logic of the program is
urls=[]
texts = []
for r in my_list:
try:
# Get text
url = urllib.urlopen(r)
content = url.read()
soup = BeautifulSoup(content, 'lxml')
# Find all of the text between paragraph tags and strip out the html
page = soup.find('p').getText()
texts.append(page)
urls.append(r)
except Exception as e:
print(e)
continue
df = pd.DataFrame({"Urls" : urls, "Texts:" : texts})
An example of urls (my_list) might be for example: https://www.ford.com.au/performance/mustang/ , https://soperth.com.au/perths-best-fish-and-chips-46154, https://www.tripadvisor.com.au/Restaurants-g255103-zfd10901-Perth_Greater_Perth_Western_Australia-Fish_and_Chips.html, https://www.bbc.co.uk/programmes/b07d2wy4
How can I correctly store links and the text on that specific page (so no the whole website!)?
Expected output:
Urls Texts
https://www.ford.com.au/performance/mustang/ Nothing else offers the unique combination of classic style and exhilarating performance quite like the Ford Mustang. Whether it’s the Fastback or Convertible, 5.0L V8 or High Performance 2.3L, the Mustang has a heritage few other cars can match.
https://soperth.com.au/perths-best-fish-and-chips-46154
https://www.tripadvisor.com.au/Restaurants-g255103-zfd10901-Perth_Greater_Perth_Western_Australia-Fish_and_Chips.html
https://www.bbc.co.uk/programmes/b07d2wy4
where in Texts I should have for each url the paragraph containing in that page (i.e., all
elements).
It would be helpful even a dummy code (so not exactly mine) to understand where my error is. I guess my currently error might be at this step: url = urllib.urlopen(r)
as I have no text.
I tried the following code(python3: hence the urllib.request), it works. Added user agent as urlopen was hanging up.
import pandas as pd
import urllib
from bs4 import BeautifulSoup
urls = []
texts = []
my_list = ["https://www.ford.com.au/performance/mustang/", "https://soperth.com.au/perths-best-fish-and-chips-46154",
"https://www.tripadvisor.com.au/Restaurants-g255103-zfd10901-Perth_Greater_Perth_Western_Australia-Fish_and_Chips.html", "https://www.bbc.co.uk/programmes/b07d2wy4"]
for r in my_list:
try:
# Get text
req = urllib.request.Request(
r,
data=None,
headers={
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
}
)
url = urllib.request.urlopen(req)
content = url.read()
soup = BeautifulSoup(content, 'lxml')
# Find all of the text between paragraph tags and strip out the html
page = ''
for para in soup.find_all('p'):
page += para.get_text()
print(page)
texts.append(page)
urls.append(r)
except Exception as e:
print(e)
continue
df = pd.DataFrame({"Urls": urls, "Texts:": texts})
print(df)