Search code examples
pythonweb-scrapingbeautifulsoupurllib

Getting texts from urls is returning empty dataframe


I am trying to get all paragraphs from a couple of websites using a for loop but I am getting an empty dataframe. The logic of the program is

urls=[]
texts = []        

for r in my_list:
                try:
                    # Get text
                    url = urllib.urlopen(r)
                    content = url.read()
                    soup = BeautifulSoup(content, 'lxml')
                    # Find all of the text between paragraph tags and strip out the html
                    page = soup.find('p').getText()
                    texts.append(page)
                    urls.append(r)
                    
                except Exception as e:
                    print(e)
                    continue

df = pd.DataFrame({"Urls" : urls, "Texts:" : texts})
    
      

An example of urls (my_list) might be for example: https://www.ford.com.au/performance/mustang/ , https://soperth.com.au/perths-best-fish-and-chips-46154, https://www.tripadvisor.com.au/Restaurants-g255103-zfd10901-Perth_Greater_Perth_Western_Australia-Fish_and_Chips.html, https://www.bbc.co.uk/programmes/b07d2wy4

How can I correctly store links and the text on that specific page (so no the whole website!)?

Expected output:

Urls                                                       Texts

https://www.ford.com.au/performance/mustang/         Nothing else offers the unique combination of classic style and exhilarating performance quite like the Ford Mustang. Whether it’s the Fastback or Convertible, 5.0L V8 or High Performance 2.3L, the Mustang has a heritage few other cars can match.
https://soperth.com.au/perths-best-fish-and-chips-46154 
https://www.tripadvisor.com.au/Restaurants-g255103-zfd10901-Perth_Greater_Perth_Western_Australia-Fish_and_Chips.html 
https://www.bbc.co.uk/programmes/b07d2wy4 

where in Texts I should have for each url the paragraph containing in that page (i.e., all

elements). It would be helpful even a dummy code (so not exactly mine) to understand where my error is. I guess my currently error might be at this step: url = urllib.urlopen(r) as I have no text.


Solution

  • I tried the following code(python3: hence the urllib.request), it works. Added user agent as urlopen was hanging up.

    import pandas as pd
    import urllib
    from bs4 import BeautifulSoup
    
    urls = []
    texts = []
    my_list = ["https://www.ford.com.au/performance/mustang/", "https://soperth.com.au/perths-best-fish-and-chips-46154",
               "https://www.tripadvisor.com.au/Restaurants-g255103-zfd10901-Perth_Greater_Perth_Western_Australia-Fish_and_Chips.html", "https://www.bbc.co.uk/programmes/b07d2wy4"]
    
    for r in my_list:
        try:
            # Get text
            req = urllib.request.Request(
                r,
                data=None,
                headers={
                    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
                }
            )
            url = urllib.request.urlopen(req)
            content = url.read()
            soup = BeautifulSoup(content, 'lxml')
    
            # Find all of the text between paragraph tags and strip out the html
            page = ''
            for para in soup.find_all('p'):
                page += para.get_text()
            print(page)
            texts.append(page)
            urls.append(r)
        except Exception as e:
            print(e)
            continue
    
    df = pd.DataFrame({"Urls": urls, "Texts:": texts})
    print(df)