i have a list of websites (more than 250) and i would like to get all the texts in the website, for further analysis. the problem occurs for some websites, which takes long time to load or it even get's stuck in the process of sending a Request.
here's the code:
def get_the_text(_df):
'''
sending a request to recieve the Text of the Articles
Parameters
----------
_df : DataFrame
Returns
-------
dataframe with the text of the articles
'''
df['text']=''
for k,link in enumerate(df['url']):
if link:
website_text=list()
print(link,'\n','K:',k)
#time.sleep(2)
session = requests.Session()
retry = Retry(connect=2, backoff_factor=0.3)
adapter = HTTPAdapter(max_retries=retry)
session.mount('http://', adapter)
session.mount('https://', adapter)
# signal.signal(signal.SIGALRM, took_too_long)
# signal.setitimer(signal.ITIMER_REAL, 10)# 10 seconds
try:
timeout_decorator.timeout(seconds=10)#timeout of 10 seconds
time.sleep(1)
response=session.get(link)
# signal.setitimer(signal.ITIMER_REAL, 0) # success, reset to 0 to disable the timer
#GETS THE TEXT IN THE WEBSITE THEN
except TimeoutError:
print('Took too long')
continue
except ConnectionError:
print('Connection error')
as you can see i tried both solutions mentioned in this post. i found out that using Signal library the SIGALRM is not supported on Windows. the second solution,which is timeout_decorator
doesn't throw exception, when it takes more than for example 10 seconds.
i would like to skip a request when it get's more than 10 second to process. how can i achieve this?
found func-timeout library that raises Exception after given seconds. This library works not only in Windows, but also on other Operating Systems.
This is the function wherein you pass the timeout, the function you want to call, and any arguments, and it runs it for up to #timeout# seconds, and will return/raise anything the passed function would otherwise return or raise.
should be used like this.
import func_timeout
for k,link in enumerate(df['url']):
if link:
try:
response = func_timeout.func_timeout(timeout=10, func=send_request, args=[link])
except func_timeout.FunctionTimedOut:
print('Took too long to respond to the request')
continue