I want to scrape google 'people also ask questions/answer'. I am doing it successfully with the following module.
pip install people_also_ask
The problem is the library is configured such that no one can send many requests to google. I want to send 1000 requests per day and to achieve that I have to add fake_useragent to module. I tried a lot but when I try to add fake user agent to header it gives error. I am not a pro so I must have done wrong myself. Can anyone help me add fake_useragent to module(people_also_ask). here is working code to get question/answer.
from encodings import utf_8
import people_also_ask as paa
from fake_useragent import UserAgent
ua = UserAgent()
while True:
input("Please make sure the queries are in \\query.txt file.\npress Enter to continue...")
try:
query_file = open("query.txt","r")
queries = query_file.readlines()
query_file.close()
break
except:
print("Error with the query.txt file...")
for query in queries:
res_file = open("result.csv","a",encoding="utf_8")
try:
query = query.replace("\n","")
except:
pass
print(f'Searching for "{query}"')
questions = paa.get_related_questions(query, 14)
questions.insert(0,query)
print("\n________________________\n")
main_q = True
for i in questions:
i = i.split('?')[0]
try:
answer = str(paa.get_answer(i)['response'])
if answer[-1].isdigit():
answer = answer[:-11]
print(f"Question:{i}?")
except Exception as e:
print(e)
print(f"Answer:{answer}")
if main_q:
a = ""
b = ""
main_q = False
else:
a = "<h2>"
b = "</h2>"
res_file.writelines(str(f'{a}{i}?{b},"<p>{answer}</p>",'))
print("______________________")
print("______________________")
res_file.writelines("\n")
res_file.close()
print("\nSearch Complete.")
input("Press any key to Exit!")
This is against Google's terms of service, and the wishes of the people_also_ask
package. This answer is for educational purposes only.
You asked why fake_useragent
is prevented from working. It's not prevented from working, but the people_also_ask
package simply isn't implementing any calls to make use of any fake_useragent
methods. You can't just import a package and expect another package to start using it. You manually have to make packages work together.
To do that, you have to have some idea of how the 2 packages work. Have a look at the source code and you will see you can make them work together very easily. Just substitute the constant header in people_also_ask
with one generated by fake_useragent
before you request any data.
paa.google.HEADERS = {'User-Agent': ua.random} # replace the HEADER with a randomised HEADER from fake_useragent
questions = paa.get_related_questions(query, 14)
and
paa.google.HEADERS = {'User-Agent': ua.random} # replace the HEADER with a randomised HEADER from fake_useragent
answer = str(paa.get_answer(i)['response'])
NOTE:
Not all user agents will work. Google doesn't give related questions depending on the user agent. It is not the fault of either the fake_useragent
, or the people_also_ask package
.
Example of the most used browsers only (despite all the user agent strings should now be up to date):
from fake_useragent import UserAgent
ua = UserAgent(min_percentage=1.3)
ua.random