Here's my input
app
fix
jd_id
zalora
leomaster
Here's my expected output
app header url
fix Fix.com | Your Source for Genuine Parts & DIY Repair Help https://www.fix.com/
jd_id jdid https://www.jd.id/
zalora ZALORA Indonesia: Belanja Online Fashion & Lifestyle Terbaru https://www.zalora.co.id/
leomaster Leomaster — Manufacturers of fine fabrics https://www.leomaster.it/en/
It can be done manually by using google chrome and exhausting copy-paste process, since I have 22000+ of app that need to be cheked, we need a scalable solution
To do this with google you will need a Google Search API account. So my solution will be with DuckDuckGo, but is obviously the same with Google:
import pandas as pd
import requests
from bs4 import BeautifulSoup
def extract_info(app_name):
query = f"{app_name} website"
url = f"https://duckduckgo.com/html/?q={query}"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, "html.parser")
search_results = soup.find_all("div", class_="result")
for result in search_results:
link = result.find("a")
if link is not None:
header = link.get_text()
url = link.get("href")
if url.startswith("https://"):
return {"app": app_name, "header": header, "url": url}
return None
app_list = ["fix", "jd_id", "zalora", "leomaster"]
results = [extract_info(app) for app in app_list]
results = [r for r in results if r is not None]
df = pd.DataFrame(results)
print(df)
which returns
app header \
0 fix iFixit: The Free Repair Manual
1 jd_id Jd.id
2 zalora Zalora - Asia'S Leading Online Fashion Destina...
3 leomaster LEOMASTER | LinkedIn
url
0 https://www.ifixit.com/
1 https://www.jd.id/
2 https://zalora.com/
3 https://www.linkedin.com/company/leomaster