Search code examples
pythongoogle-sheetsbeautifulsoupgoogle-sheets-apigspread

If duplicate URL entry exists don't append the data (BeautifulSoup to Google Sheets)


I was wondering if you can help.

I'm using beautifulsoup to write to Google Sheets.

I've created a crawler that runs through a series of URLs, scrapes the content and then updates a Google sheet.

What I now want to do is if a duplicate URL exists (in column c) to prevent it from being written to my sheet again.

e.g If I had the url https://www.bbc.co.uk/1 in my table I wouldn't want it appearing in my table again.

Here is my code:

from cgitb import text
import requests
from bs4 import BeautifulSoup
import gspread
import datetime
import urllib.parse

gc = gspread.service_account(filename='creds.json')
sh = gc.open('scrapetosheets').sheet1

urls = ["https://www.ig.com/uk/trading-strategies", "https://www.ig.com/us/trading-strategies"]

for url in urls:
    my_url = requests.get(url)
    html = my_url.content
    soup = BeautifulSoup(html, 'html.parser')

    for item in soup.find_all('h3', class_="article-category-section-title"):
        date = datetime.datetime.now()
        title = item.find('a', class_ = 'primary js_target').text.strip()
        url = item.find('a', class_ = 'primary js_target').get('href')

        abs = "https://www.ig.com"
        rel = url
    
        info = {'date':date, 'title':title, 'url':urllib.parse.urljoin(abs, rel)}
        sh.append_row([str(info['date']), str(info['title']), str(info['url'])])

Thanks in advance.

Mark

I'd like to know what i can add to the end of my code to prevent duplicate URLs being entered into my Google Sheet.


Solution

  • I believe your goal is as follows.

    • You want to put the values of [str(info['date']), str(info['title']), str(info['url'])], when the value of str(info['url']) is not existing in the column "C".

    Modification points:

    • In this case, it is required to check the column "C" of the existing sheet of sh = gc.open('scrapetosheets').sheet1. This has already been mentioned in the TheMaster's comment.

    • When I saw your script, append_row is used in a loop. In this case, the process cost will become high.

    When these points are reflected in your script, how about the following modification?

    Modified script:

    from cgitb import text
    import requests
    from bs4 import BeautifulSoup
    import gspread
    import datetime
    import urllib.parse
    
    gc = gspread.service_account(filename='creds.json')
    sh = gc.open('scrapetosheets').sheet1
    
    urls = ["https://www.ig.com/uk/trading-strategies", "https://www.ig.com/us/trading-strategies"]
    
    # I modified the below script.
    obj = {r[2]: True for r in sh.get_all_values()}
    ar = []
    
    for url in urls:
        my_url = requests.get(url)
        html = my_url.content
        soup = BeautifulSoup(html, "html.parser")
        for item in soup.find_all("h3", class_="article-category-section-title"):
            date = datetime.datetime.now()
            title = item.find("a", class_="primary js_target").text.strip()
            url = item.find("a", class_="primary js_target").get("href")
            abs = "https://www.ig.com"
            rel = url
            info = {"date": date, "title": title, "url": urllib.parse.urljoin(abs, rel)}
            url = str(info["url"])
            if url not in obj:
                ar.append([str(info["date"]), str(info["title"]), url])
    
    if ar != []:
        sh.append_rows(ar, value_input_option="USER_ENTERED")
    
    • When this script is run, first, the values are retrieved from the sheet, and create an object for searching the value of str(info["url"]). When the value of str(info["url"]) is not existing in column "C" of the sheet, the values are put into an array. And then, the array is appended to the sheet.

    Reference: