Search code examples
pythonmultithreadinggeopy

Implement multithreading (or multiprocessing?) with this script?


Let me first start this off by saying I don't have any real experience with multithreading. This script that I wrote reads ~4,400 addresses from a text file and then cleans the address and geocodes it. My brother mentioned something about using multithreading to improve the speed of it. I read online that multithreading doesn't make much of a difference if you're just using a single text file. Would it work if I split the single text file into 2 text files? Anyways, i'd really appreciate it if someone could show me how to implement multithreading or multiprocessing to this script to increase the speed. If it's not possible, could you tell me why? Thanks!

from geopy.geocoders import Bing
from geopy.exc import GeocoderTimedOut
geolocator = Bing('vadrPcGdNLSX5bPNL7tw~ySbwhthllg7rNA4VSJ-O4g~Ag28cbu9Slxp5Sh_AsBDuQ9WypPuEhl9pHVPCAkiPf4A9FgCBf3l0KyQTKKsLCHw')
import tkinter as tk
from tkinter import filedialog

root = tk.Tk()
root.withdraw()


def cleanAddress(dirty):
    try:
        clean = geolocator.geocode(dirty)
        x = clean.address
        address, city, zipcode, country = x.split(",")
        address = address.lower()
        if 'first' in address:
            address = address.replace('first', '1st')
        elif 'second' in address:
            address = address.replace('second', '2nd')
        elif 'third' in address:
            address = address.replace('third', '3rd')
        elif 'fourth' in address:
            address = address.replace('fourth', '4th')
        elif 'fifth' in address:
            address = address.replace('fifth', '5th')
        elif 'sixth' in address:
            address = address.replace('ave', '')
            address = address.replace('avenue', '')
            address = address.replace('sixth', 'avenue of the americas')
        elif '6th' in address:
            address = address.replace('ave', '')
            address = address.replace('avenue', '')
            address = address.replace('6th', 'avenue of the americas')
        elif 'seventh' in address:
            address = address.replace('seventh', '7th')
        elif 'fashion' in address:
            address = address.replace('fashion', '7th')
        elif 'eighth' in address:
            address = address.replace('eighth', '8th')
        elif 'ninth' in address:
            address = address.replace('ninth', '9th')
        elif 'tenth' in address:
            address = address.replace('tenth', '10th')
        elif 'eleventh' in address:
            address = address.replace('eleventh', '11th')
        zipcode = zipcode[3:]
        print(address + ",", zipcode.lstrip() + ",", str(clean.latitude) + ",", str(clean.longitude))
    except AttributeError:
        print('Can not be cleaned')
    except ValueError:
        print('Can not be cleaned')
    except GeocoderTimedOut as e:
        print('Can not be cleaned')        


def main():
    root.update()
    fpath = filedialog.askopenfilename()
    f = open(fpath)
    for line in f:
        dirty = line + " nyc"
        cleanAddress(dirty)
    f.close()

if __name__ == '__main__':
    main()

Solution

  • Short answer is: no, you cannot.

    Python multiprocessing library allows you to decrease time needed to do all calculations by distributing them over several processes. It can speed up whole run of your script, but only when there is a lot to calculate for CPU.

    In your example most time takes connection to web services that run geo-location stuff for you, so total execution time depends rather on your or service internet connection speed rather that your computer overall.