Search code examples
pythonbeautifulsouppython-requestsgoogle-sheets-apigspread

How to loop through URLs hosted in a Google Sheet


It's been a step by step process getting the code to this point, the goal of it was to visit a list of URLs and scrape specific data. This has been accomplished with the script below:

import requests
from bs4 import BeautifulSoup as bs
import json

headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}

urls = ['https://www.nba.com/game/bkn-vs-phi-0022100993',
        'https://www.nba.com/game/was-vs-lac-0022100992']


for url in urls:

    r = requests.get(url, headers=headers)

    soup = bs(r.text, 'html.parser')

    page_obj = soup.select_one('script#__NEXT_DATA__')
    json_obj = json.loads(page_obj.text)

    print('Title:', json_obj['props']['pageProps']
          ['story']['header']['headline'])
    print('Date:', json_obj['props']['pageProps']['story']['date'])
    print('Content:', json_obj['props']['pageProps']['story']['content'])

I had an idea I hoped to implement -- I feel I'm very close but not sure why it's not running. Basically, rather than having the static list of URLs, I wanted to use a Google Sheet as the source of URLs. Meaning, a column on this tab will have the URL list that needs to be scraped.

From there, when run, the script will PULL the URLS from the first tab, the data will get scraped, then the info will be pushed to the data in the second tab.

I've been able to print the URLs in terminal with the code above - basically, getting to the source, and requesting all records.

I thought then, I'd be able to still loop through those links in the same way (new code):

from unittest import skip
import requests
from bs4 import BeautifulSoup as bs
import json
import gspread

gc = gspread.service_account(filename='creds.json')
sh = gc.open_by_key('1NFrhsJT7T0zm3dRaP5J8OY0FryBHy5W_wEEGvwBg58I')
worksheet = sh.sheet1
freshurls = gc.open("NBA Stories").get_worksheet(1)

headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}

urls = freshurls.get_all_records()


for url in urls:
    try:
        r = requests.get(url, headers=headers)

        soup = bs(r.text, 'html.parser')

        page_obj = soup.select_one('script#__NEXT_DATA__')
        json_obj = json.loads(page_obj.text)

        title = (json_obj['props']['pageProps']['story']['header']['headline'])
        date = (json_obj['props']['pageProps']['story']['date'])
        content = str(json_obj['props']['pageProps']['story']['content'])
        AddData = [url, title, date, content]
        worksheet.append_row(AddData)

    except:
        skip

Even if I switch the ending actions (AddData & append rows) to just print the results, I'm not seeing anything.

Seems like I'm missing a step? Is there something I could do differently here to leverage those URLs right from the sheet, instead of having to paste them in the script every time?


Solution

  • SUGGESTION

    You can try using the batch_get method in a separate script file to get the URL data from a sheet tab and then just call the URL data to your scraping script file in your looping method to reduce complexity and for the readability of your script. For more context, see the sample script below.

    In my understanding, here is your goal:

    • Put a list of URLs on a specific sheet tab in a spreadsheet file.
    • Get the URL data from that Sheet tab in Python
    • Loop through it in your Python script and scrape the data per URL
    • Append each scrape data to a second sheet tab.

    Sample Script

    The getURLsFromSpreadsheet.py file

    import gspread
    
    gc = gspread.service_account(filename='creds.json')
    
    # Open a spreadsheet by ID
    sh = gc.open_by_key('1XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX')
    
    # Get the sheets
    wk = sh.worksheet("Sheet1")
    apprendWk = sh.worksheet("Sheet2")
    
    # E.G. the URLs are listed on Sheet 1 on Column A
    urls = wk.batch_get(('A2:A',) )[0]
    

    The scrapeScript.py file

    from getURLsFromSpreadsheet import *
    import requests
    from bs4 import BeautifulSoup as bs
    import json
    
    headers = {
        'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
    }
    
    
    for url in urls:
    
        r = requests.get(url[0], headers=headers)
    
        soup = bs(r.text, 'html.parser')
    
        page_obj = soup.select_one('script#__NEXT_DATA__')
        json_obj = json.loads(page_obj.text)
        samplelist = [[str(json_obj['props']['pageProps']['story']['header']['headline']),
                       str(json_obj['props']['pageProps']['story']['date']),
                       str(json_obj['props']['pageProps']['story']['content'])[2:-1]
                     ]]
        apprendWk.append_rows(samplelist)
    

    Demonstration

    Sample spreadsheet file. URLs are listed on Column A

    enter image description here

    The Sheet 2 tab after running the scrapeScript.py file

    enter image description here

    In action: enter image description here

    Reference