Search code examples
pythoncsvmultiprocessingscreen-scrapingpool

Python Multi-threading scraping, write data in csv file


I use multiprocessing pool to multiply the speed of scraping and everything is okay, only I don't understand why python write every 30 rows the header of my csv, I know there is a link with the param of pool I entered but how can correct this behavior

def parse(url):

    dico = {i: '' for i in colonnes}

    r = requests.get("https://change.org" + url, headers=headers, timeout=10)
    # sleep(2)

    if r.status_code == 200:
        # I scrape my data here
        ...
        pprint(dico)
        writer.writerow(dico)
    return dico

with open(lang + '/petitions_' + lang + '.csv', 'a') as csvfile:
     writer = csv.DictWriter(csvfile, fieldnames= colonnes)
     writer.writeheader()
     with Pool(30) as p:
         p.map(parse, liens)

Someone can tell where put the 'writer.writerow(dico)' to avoid repetition of the header? Thanks


Solution

  • Check if the file exists:

    os.path.isfile('mydirectory/myfile.csv')
    

    If it exists don't write the header again. Create a function(def...) for the header and another for data.