I have a dataframe with a column of postal addresses (generated with geopy.geocoders GoogleV3
- I used it to parse my dataframe). The output of geolocator.geocode
, however, has the country name - which I don't want. It also contains Unit number - which I don't want.
How can I do it?
I have tried:
test_add['clean address'] = test_add.apply(lambda x: x['clean address'][:-5], axis = 1)
and
def remove_units(X):
X = X.split()
X_new = [x for x in X if not x.startswith("#")]
return ' '.join(X_new)
test_add['parsed addresses'] = test_add['clean address'].apply(remove_units)
It works for:
data = ["941 Thorpe St, Rock Springs, WY 82901, USA",
"2809 Harris Dr, Antioch, CA 94509, USA",
"7 Eucalyptus, Newport Coast, CA 92657, USA",
"725 Mountain View St, Altadena, CA 91001, USA",
"1966 Clinton Ave #234, Calexico, CA 92231, USA",
"431 6th St, West Sacramento, CA 95605, USA",
"5574 Old Goodrich Rd, Clarence, NY 14031, USA",
"Valencia Way #1234, Valley Center, CA 92082, USA"]
test_df = pd.DataFrame(data, columns=['parsed addresses'])
but get an error: "AttributeError: 'float' object has no attribute 'split'" when I use a larger dataframe with 150k such addresses.
Ultimately, I require only street number, street name, city, state and zipcode.
Another possible solution:
test_df['parsed addresses'].str.replace(r',\D+$|\s#\d+', '', regex=True)
EXPLANATION
\D
means non-digit character.\D+
means one or more non-digit character$
means end of string|
means logical OR\s
means space character\d+
means one or more digit characterFor a more comprehensive treatment of regex, please see Regular Expression HOWTO.
Output:
0 941 Thorpe St, Rock Springs, WY 82901
1 2809 Harris Dr, Antioch, CA 94509
2 7 Eucalyptus, Newport Coast, CA 92657
3 725 Mountain View St, Altadena, CA 91001
4 1966 Clinton Ave, Calexico, CA 92231
5 431 6th St, West Sacramento, CA 95605
6 5574 Old Goodrich Rd, Clarence, NY 14031
7 Valencia Way, Valley Center, CA 92082
Name: parsed addresses, dtype: object