Search code examples
pythonpandasstringparsinggeopy

Parse DataFrame of postal addresses - remove country and unit number


I have a dataframe with a column of postal addresses (generated with geopy.geocoders GoogleV3 - I used it to parse my dataframe). The output of geolocator.geocode, however, has the country name - which I don't want. It also contains Unit number - which I don't want.

How can I do it?

I have tried:

test_add['clean address'] = test_add.apply(lambda x: x['clean address'][:-5], axis = 1)

and

def remove_units(X):
    X = X.split()
    X_new = [x for x in X if not x.startswith("#")]
    return ' '.join(X_new)

test_add['parsed addresses'] = test_add['clean address'].apply(remove_units)

It works for:

data = ["941 Thorpe St, Rock Springs, WY 82901, USA",
    "2809 Harris Dr, Antioch, CA 94509, USA",
    "7 Eucalyptus, Newport Coast, CA 92657, USA",
    "725 Mountain View St, Altadena, CA 91001, USA",
    "1966 Clinton Ave #234, Calexico, CA 92231, USA",
    "431 6th St, West Sacramento, CA 95605, USA",
    "5574 Old Goodrich Rd, Clarence, NY 14031, USA",
    "Valencia Way #1234, Valley Center, CA 92082, USA"]
test_df = pd.DataFrame(data, columns=['parsed addresses'])

but get an error: "AttributeError: 'float' object has no attribute 'split'" when I use a larger dataframe with 150k such addresses.

Ultimately, I require only street number, street name, city, state and zipcode.


Solution

  • Another possible solution:

    test_df['parsed addresses'].str.replace(r',\D+$|\s#\d+', '', regex=True)
    

    EXPLANATION

    • \D means non-digit character.
    • \D+ means one or more non-digit character
    • $ means end of string
    • | means logical OR
    • \s means space character
    • \d+ means one or more digit character

    For a more comprehensive treatment of regex, please see Regular Expression HOWTO.

    Output:

    0       941 Thorpe St, Rock Springs, WY 82901
    1           2809 Harris Dr, Antioch, CA 94509
    2       7 Eucalyptus, Newport Coast, CA 92657
    3    725 Mountain View St, Altadena, CA 91001
    4        1966 Clinton Ave, Calexico, CA 92231
    5       431 6th St, West Sacramento, CA 95605
    6    5574 Old Goodrich Rd, Clarence, NY 14031
    7       Valencia Way, Valley Center, CA 92082
    Name: parsed addresses, dtype: object