I use multiprocessing pool to multiply the speed of scraping and everything is okay, only I don't understand why python write every 30 rows the header of my csv, I know there is a link with the param of pool I entered but how can correct this behavior
def parse(url):
dico = {i: '' for i in colonnes}
r = requests.get("https://change.org" + url, headers=headers, timeout=10)
# sleep(2)
if r.status_code == 200:
# I scrape my data here
...
pprint(dico)
writer.writerow(dico)
return dico
with open(lang + '/petitions_' + lang + '.csv', 'a') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames= colonnes)
writer.writeheader()
with Pool(30) as p:
p.map(parse, liens)
Someone can tell where put the 'writer.writerow(dico)' to avoid repetition of the header? Thanks
Check if the file exists:
os.path.isfile('mydirectory/myfile.csv')
If it exists don't write the header again. Create a function(def...) for the header and another for data.