Search code examples
javascriptpythonweb-crawlergoogle-trends

PhantomJS browser not loading javascript for certain urls


I am trying to download Google trends data and use PhantomJS to load load the page and extract the required data. When I run my code using only one keyword in the url (example url: https://www.google.com/trends/explore?date=today%203-m&geo=US&q=Blue), it works fine. As soon as I add a second keyword (example url: https://www.google.com/trends/explore?date=today%203-m&geo=US&q=Blue,Red) PhantomJS no longer loads the page correctly and I am unable to find the data that I need. I have tried increasing the time the browser waits and have tried a number of different keywords without any success. I am out of ideas and simply do not understand why my program no longer works after changing the url so slightly (the tags and page structure are nearly identical for both urls so it's the issue is not that the tags no longer have the same name as before) here is the code in question:

    # Reading google trends data
    google_trend_array = []
    url = 'https://www.google.com/trends/explore?date=today%203-m&geo=US&q=Blue,Red'
    browser = webdriver.PhantomJS('...\\phantomjs-2.1.1-windows\\bin\\phantomjs.exe')
    ran_smooth = False
    time_to_sleep = 3
    # ran_smooth makes sure that page has loaded and necessary code was extracted, if not it will try to load the page again
    while ran_smooth is False:
        browser.get(url)
        time.sleep(time_to_sleep)
        soup = BeautifulSoup(browser.page_source, "html.parser")  # BS object to use bs4
        table = soup.find('div', {'aria-label': 'A tabular representation of the data in the chart.'})
        # If page didn't load, this try will throw an exception
        try:
            # Copies all the data out of google trends table
            for col in table.findAll('td'):
                # google has both dates and trend values, the following function ensures that we only read the trend values
                if col.string.isdigit() is True:
                    trend_number = int(col.string)
                    google_trend_array.append(trend_number)

            # program ran through, leave while loop
            ran_smooth = True
        except AttributeError:
            print 'page not loading for term ' + str(term_to_trend) + ', trying again...'
            time_to_sleep += 1  # increase time to sleep so that page can load
    print google_trend_array 

Solution

  • You ought to look at pytrends, and not reinvent the wheel.

    Here is a small example: how to extract data frame from Google Trends:

    import pytrends.request
    
    google_username = "<your_login>@gmail.com"
    google_password = "<your_password>"
    
    # connect to Google
    pytrend = pytrends.request.TrendReq(google_username, google_password, custom_useragent='My Pytrends Script')
    trend_payload = {'q': 'Pizza, Italian, Spaghetti, Breadsticks, Sausage', 'cat': '0-71'}
    # trend = pytrend.trend(trend_payload)
    
    df = pytrend.trend(trend_payload, return_type='dataframe')
    

    You'll get:

                breadsticks  italian  pizza  sausage  spaghetti
    Date                                                       
    2004-01-01          0.0      9.0   34.0      3.0        3.0
    2004-02-01          0.0     10.0   32.0      2.0        3.0
    2004-03-01          0.0     10.0   32.0      2.0        3.0
    2004-04-01          0.0      9.0   31.0      2.0        2.0
    2004-05-01          0.0      9.0   32.0      2.0        2.0
    2004-06-01          0.0      8.0   29.0      2.0        3.0
    2004-07-01          0.0      8.0   34.0      2.0        3.0
    [...]