Search code examples
pythoncsvinputbeautifulsoupconsistency

Keep consistency while input/print in python url scraping from a csv file


I need your help with this question :

I have a working python script here :

from bs4 import BeautifulSoup
import requests
import csv

with open('urls.csv', 'r') as csvFile, open('results.csv', 'w', newline='') as results:
    reader = csv.reader(csvFile, delimiter=';')
    writer = csv.writer(results)

    for row in reader:
        # get the url
        url = row[0]

        # fetch content from server
        html = requests.get(url).content

        # soup fetched content
        soup = BeautifulSoup(html, 'html.parser')

        divTag = soup.find("div", {"class": "productsPicture"})

        if divTag:
            tags = divTag.findAll("a")
        else:
            continue

        for tag in tags:
            res = tag.get('href')
            if res != None:
                writer.writerow([res])

Source: https://stackoverflow.com/a/50328564/6653461

Basically why I need to change is how to keep the consistency of the input and output, line by line. See below:

enter image description here

Idea behind all this, is to get/print the redirected link, if working link - print the link, if not, print error link or so

urls.csv sample

https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E705Y-0193; - valid
https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E703Y-0193; - non valid
https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E702Y-4589; - valid
https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E706Y-9093; - non valid

Solution

  • You just need to add some more items to the list you are writing with the csv.writerow() function:

    from bs4 import BeautifulSoup
    import requests
    import csv
    
    with open('urls.csv', 'r') as csvFile, open('results.csv', 'w', newline='') as results:
        reader = csv.reader(csvFile)
        writer = csv.writer(results)
    
        for row in reader:
            # get the url
    
            for url in row:
                url = url.strip()
    
                # Skip any empty URLs
                if len(url):
                    print(url)
    
                    # fetch content from server
    
                    try:
                        html = requests.get(url).content
                    except requests.exceptions.ConnectionError as e:
                        writer.writerow([url, '', 'bad url'])
                        continue
                    except requests.exceptions.MissingSchema as e:
                        writer.writerow([url, '', 'missing http...'])
                        continue
    
                    # soup fetched content
                    soup = BeautifulSoup(html, 'html.parser')
    
                    divTag = soup.find("div", {"class": "productsPicture"})
    
                    if divTag:
                        # Return all 'a' tags that contain an href
                        for a in divTag.find_all("a", href=True):
                            url_sub = a['href']
    
                            # Test that link is valid
                            try:
                                r = requests.get(url_sub)
                                writer.writerow([url, url_sub, 'ok'])
                            except requests.exceptions.ConnectionError as e:
                                writer.writerow([url, url_sub, 'bad link'])
                    else:
                        writer.writerow([url, '', 'no results'])
    

    Giving you:

    https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E705Y-0193,https://www.tennis-point.com/asics-gel-game-6-all-court-shoe-men-white-silver-02013802643000.html,ok
    https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E703Y-0193,https://www.tennis-point.com/asics-gel-game-6-all-court-shoe-men-white-silver-02013802643000.html,no results
    https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E702Y-4589,https://www.tennis-point.com/asics-gel-resolution-7-clay-court-shoe-men-blue-lime-02014202831000.html,ok
    https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E706Y-9093,https://www.tennis-point.com/asics-gel-resolution-7-clay-court-shoe-men-blue-lime-02014202831000.html,no results
    

    Exception handling can catch the case where the URL from the CSV file is invalid. You can also test that the URL returned from the link on the page is valid. The third column could then give you a status, i.e. ok, bad url, no results or bad link.

    It assumes that all columns in your CSV file contain URLs that need to be tested.