I'm building a scraper that needs to perform pretty fast, over a large amount of webpages. The results of the code below will be a csv file with a list of links (and other things). Basically, I create a list of webpages that contain several links, and for each of this pages I collect these links.
Implementing multiprocessing leads to some weird results, that I wasn't able to explain. If I run this code setting the value of the pool to 1 (hence, without multithreading) I get a final result in which I have 0.5% of duplicated links (which is fair enough). As soon as I speed it up setting the value to 8, 12 or 24, I get around 25% of duplicate links in the final results.
I suspect my mistake is in the way I write the results to the csv file or in the way I use the imap()
function (same happens with imap_unordered
, map
etc..), which leads the threads to somehow access the same elements on the iterable passed. Any suggestion?
#!/usr/bin/env python
# coding: utf8
import sys
import requests, re, time
from bs4 import BeautifulSoup
from lxml import etree
from lxml import html
import random
import unicodecsv as csv
import progressbar
import multiprocessing
from multiprocessing.pool import ThreadPool
keyword = "keyword"
def openup():
global crawl_list
try:
### Generate list URLS based on the number of results for the keyword, each of these contains other links. The list is subsequently randomized
startpage = 1
## Get endpage
url0 = myurl0
r0 = requests.get(url0)
print "First request: "+str(r0.status_code)
tree = html.fromstring(r0.content)
endpage = tree.xpath("//*[@id='habillagepub']/div[5]/div/div[1]/section/div/ul/li[@class='adroite']/a/text()")
print str(endpage[0]) + " pages found"
### Generate random sequence for crawling
crawl_list = random.sample(range(1,int(endpage[0])+1), int(endpage[0]))
return crawl_list
except Exception as e:
### Catches openup error and return an empty crawl list, then breaks
print e
crawl_list = []
return crawl_list
def worker_crawl(x):
### Open page
url_base = myurlbase
r = requests.get(url_base)
print "Connecting to page " + str(x) +" ..."+ str(r.status_code)
while True:
if r.status_code == 200:
tree = html.fromstring(r.content)
### Get data
titles = tree.xpath('//*[@id="habillagepub"]/div[5]/div/div[1]/section/article/div/div/h3/a/text()')
links = tree.xpath('//*[@id="habillagepub"]/div[5]/div/div[1]/section/article/div/div/h3/a/@href')
abstracts = tree.xpath('//*[@id="habillagepub"]/div[5]/div/div[1]/section/article/div/div/p/text()')
footers = tree.xpath('//*[@id="habillagepub"]/div[5]/div/div[1]/section/article/div/div/span/text()')
dates = []
pagenums = []
for f in footers:
pagenums.append(x)
match = re.search(r'\| .+$', f)
if match:
date = match.group()
dates.append(date)
pageindex = zip(titles,links,abstracts,footers,dates,pagenums) #what if there is a missing value?
return pageindex
else:
pageindex = [[str(r.status_code),"","","","",str(x)]]
return pageindex
continue
def mp_handler():
### Write down:
with open(keyword+'_results.csv', 'wb') as outcsv:
wr = csv.DictWriter(outcsv, fieldnames=["title","link","abstract","footer","date","pagenum"])
wr.writeheader()
results = p.imap(worker_crawl, crawl_list)
for result in results:
for x in result:
wr.writerow({
#"keyword": str(keyword),
"title": x[0],
"link": x[1],
"abstract": x[2],
"footer": x[3],
"date": x[4],
"pagenum": x[5],
})
if __name__=='__main__':
p = ThreadPool(4)
openup()
mp_handler()
p.terminate()
p.join()
Are you sure the page responds with the correct response in a fast sequence of requests? I have been in situations where the scraped site responded with different responses if the requests were fast vs. if the requests were spaced in time. Menaing, everything went perfectly while debugging but as soon as the requests were fast and in sequence, the website decided to give me a different response. Beside this, I would ask if the fact you are writing in a non-thread-safe environment might have impact: To minimize interactions on the final CSV output and issues with the data, you might: