Search code examples
pythonweb-scrapingbeautifulsoupgoogle-search

Python search website in google that end with specific word


I tried to search all websites in Google that end with "gencat.cat".

My code:

import requests, lxml
from bs4 import BeautifulSoup

headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {'q': 'gencat.cat'}
html = requests.get('https://www.google.com/search', headers=headers, params=params).text
soup = BeautifulSoup(html, 'lxml')

# containver with all needed data
for result in soup.select('.tF2Cxc'):
    link = result.a['href'] # or ('.yuRUbf a')['href']
    print(link)

Output that I have:

The problem is that only a few websites are searched and also it takes some urls without "gencat.cat" in them or repeats pages from the same site:

https://web.gencat.cat/ca/inici
https://web.gencat.cat/es/inici/
https://web.gencat.cat/ca/tramits
https://web.gencat.cat/en/inici/index.html
https://govern.cat/
https://govern.cat/salapremsa/
http://www.gencat.es/
http://www.regencos.cat/promocio-variable/preguntes-mes-frequents-sobre-el-coronavirus/
https://tauler.seu.cat/inici.do?idens=1

Output that I want:

https://web.gencat.cat
http://agricultura.gencat.cat
http://cultura.gencat.cat
https://dretssocials.gencat.cat
http://economia.gencat.cat

Solution

  • If you are wanting the top-level domain, you can split the link on all the instances of "/" in the link variable.

    for result in soup.select('.tF2Cxc'):
    link = result.a['href'] # or ('.yuRUbf a')['href']
    print(link)
    
    string_splt = link.split("/")
    TLD = f"https://{string_splt[2]}"
    
    print(TLD)
    

    I am sure there is a better way to bring it all back together but this seems to work. You will also need to handle the duplicates as well.