I am using pytrends to download search interest in newspapers by metro area. Here is an example for one metro area (Austin, TX):
from pytrends.request import TrendReq
import pandas as pd
code='US-TX-635'
papers=['The Wall Street Journal','New York Post','The New York Times','Boston Herald','San Francisco Chronicle']
pytrend = TrendReq()
pytrend.build_payload(kw_list=papers,cat=408,timeframe='all',geo=code)
test = pytrend.interest_over_time()
I understand that there is some randomness in Google Trends (referenced in this post), but the differences I am getting are more drastic than they should be just based on that and they persist even when I take many samples and average across them. For example, when I perform the search for five newspapers on the Google Trends site, while the exact numbers vary, it is always the case that the papers in order of popularity are New York Times, Wall Street Journal, New York Post, San Francisco Chronicle, Boston Herald. This is not the case in any of the samples I get from pytrends. Further, as one would expect, search interest for most of the papers peaks during the financial crisis according to the data from the site, but this is also not the case in the pytrends data.
For reference, here is the query I did on the site.
Does anyone know why this might be happening or if there is another API that might yield more accurate results?
I know the answer to your question, as I was experiencing a similar issue! The public-facing Google Trends site is showing you data generated for each newspaper as a knowledge graph entity (i.e., topic) as opposed to the string query itself. For example, the Wall Street Journal as a topic is represented by the Freebase ID /m/017b3j
on the Google Trends site. Querying by topic includes relevant searches with typos and indirect descriptions. This should account for the differences you are seeing in the data.
When using pytrends, the keyword 'The Wall Street Journal'
is treated as a literal search term instead of as a topic. If you replaced this with '/m/017b3j'
, Google will treat the query as a topic and your results should match those on the Trends website.
(Note that in your linked reference query, the WSJ is represented by %2Fm%2F017b3j
, which is the URL-encoded version of /m/017b3j
)
Hope this helps!