Search code examples
pythonpandasdataframeautomationgoogle-api

Can we convert from text to existing header and URL that available in search engine using pandas


Here's my input

app
fix
jd_id
zalora
leomaster

Here's my expected output

app         header                                                        url
fix         Fix.com | Your Source for Genuine Parts & DIY Repair Help     https://www.fix.com/             
jd_id       jdid                                                          https://www.jd.id/
zalora      ZALORA Indonesia: Belanja Online Fashion & Lifestyle Terbaru  https://www.zalora.co.id/   
leomaster   Leomaster — Manufacturers of fine fabrics                  https://www.leomaster.it/en/

It can be done manually by using google chrome and exhausting copy-paste process, since I have 22000+ of app that need to be cheked, we need a scalable solution


Solution

  • To do this with google you will need a Google Search API account. So my solution will be with DuckDuckGo, but is obviously the same with Google:

    import pandas as pd
    import requests
    from bs4 import BeautifulSoup
    
    def extract_info(app_name):
        query = f"{app_name} website"
    
        url = f"https://duckduckgo.com/html/?q={query}"
    
        headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
        }
    
        response = requests.get(url, headers=headers)
    
        soup = BeautifulSoup(response.content, "html.parser")
    
        search_results = soup.find_all("div", class_="result")
    
        for result in search_results:
            link = result.find("a")
            if link is not None:
                header = link.get_text()
                url = link.get("href")
                if url.startswith("https://"):
                    return {"app": app_name, "header": header, "url": url}
    
        return None
    
    app_list = ["fix", "jd_id", "zalora", "leomaster"]
    
    results = [extract_info(app) for app in app_list]
    
    results = [r for r in results if r is not None]
    
    df = pd.DataFrame(results)
    
    print(df)
    

    which returns

             app                                             header  \
    0        fix                     iFixit: The Free Repair Manual   
    1      jd_id                                              Jd.id   
    2     zalora  Zalora - Asia'S Leading Online Fashion Destina...   
    3  leomaster                               LEOMASTER | LinkedIn   
    
                                              url  
    0                     https://www.ifixit.com/  
    1                          https://www.jd.id/  
    2                         https://zalora.com/  
    3  https://www.linkedin.com/company/leomaster