I was wondering if you can help.
I'm using beautifulsoup to write to Google Sheets.
I've created a crawler that runs through a series of URLs, scrapes the content and then updates a Google sheet.
What I now want to do is if a duplicate URL exists (in column c) to prevent it from being written to my sheet again.
e.g If I had the url https://www.bbc.co.uk/1 in my table I wouldn't want it appearing in my table again.
Here is my code:
from cgitb import text
import requests
from bs4 import BeautifulSoup
import gspread
import datetime
import urllib.parse
gc = gspread.service_account(filename='creds.json')
sh = gc.open('scrapetosheets').sheet1
urls = ["https://www.ig.com/uk/trading-strategies", "https://www.ig.com/us/trading-strategies"]
for url in urls:
my_url = requests.get(url)
html = my_url.content
soup = BeautifulSoup(html, 'html.parser')
for item in soup.find_all('h3', class_="article-category-section-title"):
date = datetime.datetime.now()
title = item.find('a', class_ = 'primary js_target').text.strip()
url = item.find('a', class_ = 'primary js_target').get('href')
abs = "https://www.ig.com"
rel = url
info = {'date':date, 'title':title, 'url':urllib.parse.urljoin(abs, rel)}
sh.append_row([str(info['date']), str(info['title']), str(info['url'])])
Thanks in advance.
Mark
I'd like to know what i can add to the end of my code to prevent duplicate URLs being entered into my Google Sheet.
I believe your goal is as follows.
[str(info['date']), str(info['title']), str(info['url'])]
, when the value of str(info['url'])
is not existing in the column "C".In this case, it is required to check the column "C" of the existing sheet of sh = gc.open('scrapetosheets').sheet1
. This has already been mentioned in the TheMaster's comment.
When I saw your script, append_row
is used in a loop. In this case, the process cost will become high.
When these points are reflected in your script, how about the following modification?
from cgitb import text
import requests
from bs4 import BeautifulSoup
import gspread
import datetime
import urllib.parse
gc = gspread.service_account(filename='creds.json')
sh = gc.open('scrapetosheets').sheet1
urls = ["https://www.ig.com/uk/trading-strategies", "https://www.ig.com/us/trading-strategies"]
# I modified the below script.
obj = {r[2]: True for r in sh.get_all_values()}
ar = []
for url in urls:
my_url = requests.get(url)
html = my_url.content
soup = BeautifulSoup(html, "html.parser")
for item in soup.find_all("h3", class_="article-category-section-title"):
date = datetime.datetime.now()
title = item.find("a", class_="primary js_target").text.strip()
url = item.find("a", class_="primary js_target").get("href")
abs = "https://www.ig.com"
rel = url
info = {"date": date, "title": title, "url": urllib.parse.urljoin(abs, rel)}
url = str(info["url"])
if url not in obj:
ar.append([str(info["date"]), str(info["title"]), url])
if ar != []:
sh.append_rows(ar, value_input_option="USER_ENTERED")
str(info["url"])
. When the value of str(info["url"])
is not existing in column "C" of the sheet, the values are put into an array. And then, the array is appended to the sheet.