Search code examples
pythonpandasfor-loopif-statementpygsheets

How to create a new column from function result


Currently running the below script which checks for errors within a long list of urls. This code first looks for unique urls within df['Final_URL'], tests each separate url and returns the status of that link url. When I run the below code I get the current output on my notebook which is fine. Now I would like to push the status code (e.g 200, 404, BAD etc) to a new column in my df called "Status" for all url which equal to the unique urls I got at the beginning of the code.

What's the best way to create the new column df['Status'] and since I want to export this to google sheets do you know if text color is preserved when updating cells using pygsheets?

Input code:
#get unique urls and check for errors
URLS = []

for unique_link in df['Final_URL'].unique():
    URLS.append(unique_link)

try:

    GREEN = '\033[92m'
    YELLOW = '\033[93m'
    RED = '\033[91m'
    ENDC = '\033[0m'

    def main():
        while True:
            print ("\nTesting URLs.", time.ctime())
            checkUrls()
            time.sleep(10) #Sleep 10 seconds
            break

    def checkUrls():     
        for url in URLS:
            status = "N/A"
            try:
                #check if regex contains bet3.com
                if re.search(".*bet3\.com.*", url):
                    status = checkUrl(url)
                else:
                    status = "BAD"

            except requests.exceptions.ConnectionError:
                status = "DOWN"

            printStatus(url, status)

            #for x in df['Final_URL']:
            #    if x == url:
            #        df['Status'] = printStatus(status)



    def checkUrl(url):
        r = requests.get(url, timeout=5)
        #print r.status_code
        return str(r.status_code)

    def printStatus(url, status):
        color = GREEN

        if status != "200":
            color=RED

        print (color+status+ENDC+' '+ url)



    #
    # Main app
    #
    if __name__ == '__main__':
        main()

except:

    print('Something went wrong!')



Current output:

200 https://www.bet3.com/dl/~offer
404 http://extra.bet3.com/promotions/en/soccer/soccer-accumulator-bonus
BAD https://extra.betting3.com/features/en/bet-builder
200 https://www.bet3.com/dl/6

Solution

  • You can just rewrite your function as such

    def checkUrl(url):
        if re.search(".*bet3\.com.*", url):
            try:
                r = requests.get(url, timeout=5)
            except requests.exceptions.ConnectionError:
                return 'DOWN'
            return str(r.status_code)
        return 'BAD'
    

    and then apply it like this

    df['Status'] = df['Final_URL'].apply(checkUrl)
    

    although, as noticed by user32185, in case there's duplicate URLs, this will call them twice.

    To avoid that you can do as user32185 suggested and write your function like so:

    def checkUrls(urls):
        results = []
        for url in urls:
            if re.search(".*bet3\.com.*", url):
                try:
                    r = requests.get(url, timeout=5)
                except requests.exceptions.ConnectionError:
                    results.append([url, 'DOWN'])
                results.append([url, str(r.status_code)])
            else:
                results.append([url, 'BAD'])
        return pd.DataFrame(data=results, columns=['Final_URL', 'Status'])
    

    and then use it like this:

    status_df = checkUrls(df['Final_URL'].unique())
    df = df.merge(status_df, how='left', on='Final_URL')