python csv input beautifulsoup consistency

Keep consistency while input/print in python url scraping from a csv file

I need your help with this question :

I have a working python script here :

from bs4 import BeautifulSoup
import requests
import csv

with open('urls.csv', 'r') as csvFile, open('results.csv', 'w', newline='') as results:
    reader = csv.reader(csvFile, delimiter=';')
    writer = csv.writer(results)

    for row in reader:
        # get the url
        url = row[0]

        # fetch content from server
        html = requests.get(url).content

        # soup fetched content
        soup = BeautifulSoup(html, 'html.parser')

        divTag = soup.find("div", {"class": "productsPicture"})

        if divTag:
            tags = divTag.findAll("a")
        else:
            continue

        for tag in tags:
            res = tag.get('href')
            if res != None:
                writer.writerow([res])

Source: https://stackoverflow.com/a/50328564/6653461

Basically why I need to change is how to keep the consistency of the input and output, line by line. See below:

Idea behind all this, is to get/print the redirected link, if working link - print the link, if not, print error link or so

urls.csv sample

https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E705Y-0193; - valid
https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E703Y-0193; - non valid
https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E702Y-4589; - valid
https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E706Y-9093; - non valid

Solution

You just need to add some more items to the list you are writing with the csv.writerow() function:

from bs4 import BeautifulSoup
import requests
import csv

with open('urls.csv', 'r') as csvFile, open('results.csv', 'w', newline='') as results:
    reader = csv.reader(csvFile)
    writer = csv.writer(results)

    for row in reader:
        # get the url

        for url in row:
            url = url.strip()

            # Skip any empty URLs
            if len(url):
                print(url)

                # fetch content from server

                try:
                    html = requests.get(url).content
                except requests.exceptions.ConnectionError as e:
                    writer.writerow([url, '', 'bad url'])
                    continue
                except requests.exceptions.MissingSchema as e:
                    writer.writerow([url, '', 'missing http...'])
                    continue

                # soup fetched content
                soup = BeautifulSoup(html, 'html.parser')

                divTag = soup.find("div", {"class": "productsPicture"})

                if divTag:
                    # Return all 'a' tags that contain an href
                    for a in divTag.find_all("a", href=True):
                        url_sub = a['href']

                        # Test that link is valid
                        try:
                            r = requests.get(url_sub)
                            writer.writerow([url, url_sub, 'ok'])
                        except requests.exceptions.ConnectionError as e:
                            writer.writerow([url, url_sub, 'bad link'])
                else:
                    writer.writerow([url, '', 'no results'])

Giving you:

https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E705Y-0193,https://www.tennis-point.com/asics-gel-game-6-all-court-shoe-men-white-silver-02013802643000.html,ok
https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E703Y-0193,https://www.tennis-point.com/asics-gel-game-6-all-court-shoe-men-white-silver-02013802643000.html,no results
https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E702Y-4589,https://www.tennis-point.com/asics-gel-resolution-7-clay-court-shoe-men-blue-lime-02014202831000.html,ok
https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E706Y-9093,https://www.tennis-point.com/asics-gel-resolution-7-clay-court-shoe-men-blue-lime-02014202831000.html,no results

Exception handling can catch the case where the URL from the CSV file is invalid. You can also test that the URL returned from the link on the page is valid. The third column could then give you a status, i.e. ok, bad url, no results or bad link.

It assumes that all columns in your CSV file contain URLs that need to be tested.