I am trying to download Google trends data and use PhantomJS to load load the page and extract the required data. When I run my code using only one keyword in the url (example url: https://www.google.com/trends/explore?date=today%203-m&geo=US&q=Blue), it works fine. As soon as I add a second keyword (example url: https://www.google.com/trends/explore?date=today%203-m&geo=US&q=Blue,Red) PhantomJS no longer loads the page correctly and I am unable to find the data that I need. I have tried increasing the time the browser waits and have tried a number of different keywords without any success. I am out of ideas and simply do not understand why my program no longer works after changing the url so slightly (the tags and page structure are nearly identical for both urls so it's the issue is not that the tags no longer have the same name as before) here is the code in question:
# Reading google trends data
google_trend_array = []
url = 'https://www.google.com/trends/explore?date=today%203-m&geo=US&q=Blue,Red'
browser = webdriver.PhantomJS('...\\phantomjs-2.1.1-windows\\bin\\phantomjs.exe')
ran_smooth = False
time_to_sleep = 3
# ran_smooth makes sure that page has loaded and necessary code was extracted, if not it will try to load the page again
while ran_smooth is False:
browser.get(url)
time.sleep(time_to_sleep)
soup = BeautifulSoup(browser.page_source, "html.parser") # BS object to use bs4
table = soup.find('div', {'aria-label': 'A tabular representation of the data in the chart.'})
# If page didn't load, this try will throw an exception
try:
# Copies all the data out of google trends table
for col in table.findAll('td'):
# google has both dates and trend values, the following function ensures that we only read the trend values
if col.string.isdigit() is True:
trend_number = int(col.string)
google_trend_array.append(trend_number)
# program ran through, leave while loop
ran_smooth = True
except AttributeError:
print 'page not loading for term ' + str(term_to_trend) + ', trying again...'
time_to_sleep += 1 # increase time to sleep so that page can load
print google_trend_array
You ought to look at pytrends, and not reinvent the wheel.
Here is a small example: how to extract data frame from Google Trends:
import pytrends.request
google_username = "<your_login>@gmail.com"
google_password = "<your_password>"
# connect to Google
pytrend = pytrends.request.TrendReq(google_username, google_password, custom_useragent='My Pytrends Script')
trend_payload = {'q': 'Pizza, Italian, Spaghetti, Breadsticks, Sausage', 'cat': '0-71'}
# trend = pytrend.trend(trend_payload)
df = pytrend.trend(trend_payload, return_type='dataframe')
You'll get:
breadsticks italian pizza sausage spaghetti
Date
2004-01-01 0.0 9.0 34.0 3.0 3.0
2004-02-01 0.0 10.0 32.0 2.0 3.0
2004-03-01 0.0 10.0 32.0 2.0 3.0
2004-04-01 0.0 9.0 31.0 2.0 2.0
2004-05-01 0.0 9.0 32.0 2.0 2.0
2004-06-01 0.0 8.0 29.0 2.0 3.0
2004-07-01 0.0 8.0 34.0 2.0 3.0
[...]