I have been working on a no-sql solution to naming a list of N postcodes using a national list of postcodes. So far I have my reference dictionary for the state of NSW in the form :
{'Belowra': 2545, 'Yambulla': 2550, 'Bingie': 2537, ... [n=4700]
My function uses this to look up the names of a postcode:
def look_up_sub(pc, settings):
output=[]
for suburb, postcode in postcode_dict.items():
if postcode == pc and settings=='random':#select match at random
print(suburb) #remove later
output.append(suburb)
break #stop searching for matches
elif postcode == pc and settings=='all': #print all possible names for postcode
print(suburb) #remove later
return output
N=[2000,2020,2120,2019]
for i in N:
look_up_sub(i, 'random')
>>>Millers Point
>>>Mascot
>>>Westleigh
>>>Banksmeadow
While ok for small lists, when N is sufficiently large this inefficient approach is very slow. I have been thinking about how I could use numpy arrays to speed this up considerably and am looking for faster ways to approach this.
Your data structure is backwards, it should go from postcode:suburb
and then when you pass it a pc you get a list of suburbs back, then either select from that list randomly or print all of them in the list.
Here is what you should do, first, reverse your dict:
import defaultdict
post_to_burb = defaultdict(list)
for suburb, postcode in postcode_dict.items():
post_to_burb[postcode].append(suburb)
Now, your function should do something like:
import random
def look_up_sub(pc, settings):
output = []
if settings == "random":
output.append(random.choice(post_to_burb[pc]))
elif settings == 'all':
output.extend(post_to_burb[pc])
return output
Using numpy here would be unweildy, especially since you are working with strings. You might get some marginal imporvemnt in runtime, but your overall algorithm will still be linear time. Now it is constant time, once you've set up your post_to_burb
dict.