So, what I want to do is to convert some words from the string into their respective words in dictionary and rest as it is.For example by giving input as:
standarisationn("well-2-34 2 @$%23beach bend com")
I want output as:
"well-2-34 2 @$%23bch bnd com"
The codes I was using is:
def standarisationn(addr):
a=re.sub(',', ' ', addr)
lookp_dict = {"allee":"ale","alley":"ale","ally":"ale","aly":"ale",
"arcade":"arc",
"apartment":"apt","aprtmnt":"apt","aptmnt":"apt",
"av":"ave","aven":"ave","avenu":"ave","avenue":"ave","avn":"ave","avnue":"ave",
"beach":"bch",
"bend":"bnd",
"blfs":"blf","bluf":"blf","bluff":"blf","bluffs":"blf",
"boul":"blvd","boulevard":"blvd","boulv":"blvd",
"bottm":"bot","bottom":"bot",
"branch":"br","brnch":"br",
"brdge":"brg","bridge":"brg",
"bypa":"byp","bypas":"byp","bypass":"byp","byps":"byp",
"camp":"cmp",
"canyn":"cny","canyon":"cny","cnyn":"cny",
"southwest":"sw" ,"northwest":"nw"}
temp=re.findall(r"[A-Za-z0-9]+|\S", a)
print(temp)
res = []
for wrd in temp:
res.append(lookp_dict.get(wrd,wrd))
res = ' '.join(res)
return str(res)
but its giving the wrong output as:
'well - 2 - 34 2 @ $ % 23beach bnd com'
that is with too many spaces and not even converting "beach" to "bch".So, that's the issue.What I thought is too first split the string by spaces and then split the resultant elements by special characters and numbers and the use the dictionary and then first join the separated strings by special characters without space and then all the list by space.Can anyone suggest how to go about this or any better method?
You can build you regular expression with the keys of your dictionary, ensuring they're not enclosed in another word (i.e. not directly preceded nor followed by a letter):
import re
def standarisationn(addr):
addr = re.sub(r'(,|\s+)', " ", addr)
lookp_dict = {"allee":"ale","alley":"ale","ally":"ale","aly":"ale",
"arcade":"arc",
"apartment":"apt","aprtmnt":"apt","aptmnt":"apt",
"av":"ave","aven":"ave","avenu":"ave","avenue":"ave","avn":"ave","avnue":"ave",
"beach":"bch",
"bend":"bnd",
"blfs":"blf","bluf":"blf","bluff":"blf","bluffs":"blf",
"boul":"blvd","boulevard":"blvd","boulv":"blvd",
"bottm":"bot","bottom":"bot",
"branch":"br","brnch":"br",
"brdge":"brg","bridge":"brg",
"bypa":"byp","bypas":"byp","bypass":"byp","byps":"byp",
"camp":"cmp",
"canyn":"cny","canyon":"cny","cnyn":"cny",
"southwest":"sw" ,"northwest":"nw"}
for wrd in lookp_dict:
addr = re.sub(rf'(?:^|(?<=[^a-zA-Z])){wrd}(?=[^a-zA-Z]|$)', lookp_dict[wrd], addr)
return addr
print(standarisationn("well-2-34 2 @$%23beach bend com"))
The expression is built in three parts:
^
matches the beginning of the string(?<=[^a-zA-Z])
is a lookbehind (ie a non capturing expression), checking that the preceding character is a letter{wrd}
is the key of your dictionary(?=[^a-zA-Z]|$)
is a lookahead (ie a non capturing expression), checking that the following character is a letter or the end of the stringOutput:
well-2-34 2 @$%23bch bnd com
Edit: you can compile a whole expression and use re.sub only once if you replace the loop with:
repl_pattern = re.compile(rf"(?:^|(?<=[^a-zA-Z]))({'|'.join(lookp_dict.keys())})(?=([^a-zA-Z]|$))")
addr = re.sub(repl_pattern, lambda x: lookp_dict[x.group(1)], addr)
This should be much faster if your dictionary grows because we build a single expression with all your dictionary keys:
({'|'.join(lookp_dict.keys())})
is interpreted as (allee|alley|...