Search code examples
pythonregexpandasstreet-address

Extracting and then combining Street Address and Apartment Numbers in pandas


I'm having a really challenging time coming up with the code needed to perform the below. There's a similar question with this but I can't figure out how to adapt the code to what my needs are particularly.

I have a pandas data frame that is over 100k rows in length. Here is what the address and apartment number format looks like currently:

Current DF:

temp = {'col1': ['220 CENTRAL STREET, 50', '165 EAST 66TH ST, RESI', '106 SPRUCE STREET, 1', '14 EAST 67TH STREET', '1131 OGEN AVENUE', '200 EAST 1ST STREET, RU', '520 PARK LANE', '520 PARK LANE', '80 BAY STREET LANDING, 1A', '520 PARK SOUTH, DPH54', '520 PARK LANE', '62 VEST STREET', '256 FLARIN AVENUE'], 'col2':['50', 'RESI', 'nan', 'nan', 'nan', '2A', 'DPH60', 'DPH56', '1A', 'DPH54', 'DPH52', '21F', 'nan']}
data = pd.DataFrame(temp)
data

               col1                   col2
0      220 CENTRAL STREET, 50            50
1      165 EAST 66TH ST, RESI          RESI
2        106 SPRUCE STREET, 1           nan
3         14 EAST 67TH STREET           nan
4            1131 OGEN AVENUE           nan
5     200 EAST 1ST STREET, RU            2A
6               520 PARK LANE         DPH60
7               520 PARK LANE         DPH56
8   80 BAY STREET LANDING, 1A            1A
9       520 PARK SOUTH, DPH54         DPH54
10              520 PARK LANE         DPH52
11             62 VEST STREET           21F
12          256 FLARIN AVENUE           nan


Desired DF (data1), which adds 3 new columns to allow for different levels of granularity later on:

temp1 = {'col1': ['220 CENTRAL STREET, 50', '165 EAST 66TH ST, RESI', '106 SPRUCE STREET, 1', '14 EAST 67TH STREET', '1131 OGEN AVENUE', '200 EAST 1ST STREET, RU', '520 PARK LANE', '520 PARK LANE', '80 BAY STREET LANDING, 1A', '520 PARK SOUTH, DPH54', '520 PARK LANE', '62 VEST STREET', '256 FLARIN AVENUE'],
         'col2':['50', 'RESI', 'nan', 'nan', 'nan', '2A', 'DPH60', 'DPH56', '1A', 'DPH54', 'DPH52', '21F', 'nan'], 
         'building_address':['220 CENTRAL STREET', '165 EAST 66TH ST', '106 SPRUCE STREET', '14 EAST 67TH STREET', '1131 OGEN AVENUE', '200 EAST 1ST STREET', '520 PARK LANE', '520 PARK LANE', '80 BAY STREET LANDING', '520 PARK SOUTH', '520 PARK LANE', '62 VEST STREET', '256 FLARIN AVENUE'],
         'apt_no': ['50', 'RESI', '1', 'nan', 'nan', '2A', 'DPH60', 'DPH56', '1A', 'DPH54', 'DPH52', '21F', 'nan'],
         'full_address':['220 CENTRAL STREET, 50', '165 EAST 66TH ST, RESI', '106 SPRUCE STREET, 1', '14 EAST 67TH STREET', '1131 OGEN AVENUE', '200 EAST 1ST STREET, 2A', '520 PARK LANE, DPH60', '520 PARK LANE, DPH56', '80 BAY STREET LANDING, 1A', '520 PARK SOUTH, DPH54', '520 PARK LANE, DPH52', '62 VEST STREET, 21F', '256 FLARIN AVENUE']}

data1 = pd.DataFrame(temp1)
data1


                col1                   col2       building_address   apt_no  \
0      220 CENTRAL STREET, 50            50     220 CENTRAL STREET     50   
1      165 EAST 66TH ST, RESI          RESI       165 EAST 66TH ST   RESI   
2        106 SPRUCE STREET, 1           nan      106 SPRUCE STREET      1   
3         14 EAST 67TH STREET           nan    14 EAST 67TH STREET    nan   
4            1131 OGEN AVENUE           nan       1131 OGEN AVENUE    nan   
5     200 EAST 1ST STREET, RU            2A    200 EAST 1ST STREET     2A   
6               520 PARK LANE         DPH60          520 PARK LANE  DPH60   
7               520 PARK LANE         DPH56          520 PARK LANE  DPH56   
8   80 BAY STREET LANDING, 1A            1A  80 BAY STREET LANDING     1A   
9       520 PARK SOUTH, DPH54         DPH54         520 PARK SOUTH  DPH54   
10              520 PARK LANE         DPH52          520 PARK LANE  DPH52   
11             62 VEST STREET           21F         62 VEST STREET    21F   
12          256 FLARIN AVENUE           nan      256 FLARIN AVENUE    nan   

                 full_address  
0      220 CENTRAL STREET, 50  
1      165 EAST 66TH ST, RESI  
2        106 SPRUCE STREET, 1  
3         14 EAST 67TH STREET  
4            1131 OGEN AVENUE  
5     200 EAST 1ST STREET, 2A  
6        520 PARK LANE, DPH60  
7        520 PARK LANE, DPH56  
8   80 BAY STREET LANDING, 1A  
9       520 PARK SOUTH, DPH54  
10       520 PARK LANE, DPH52  
11        62 VEST STREET, 21F  
12          256 FLARIN AVENUE  


In the existing DF (data), col1 is a street address that may or may not contain an apartment number. For simplicity, I'm assuming the values under col1 will have an apartment number if there's a comma. The part after the comma can be considered as the apartment number.

col2 contains the apartment number alone. It has nan's in the column. In some cases, like in row 5, the apartment number in col2 ('2A') will not match the portion after the comma in col1 ('RU'). In other cases, like in row 2, col2 may be nan but col1 has an apartment number after the comma.

What I want to do is add 3 new columns (shown in the Desired DF data1):

['building_address'] will essentially just contain everything before the comma, so it will say '220 CENTRAL STREET' whereas col1 will say '220 CENTRAL STREET, 50'

['apt_no'] will check if there is an nan. If there is, it will then check in col1 for a value after the comma. If the check is successful, it will populate that value in col2. So for example, in data1 row 2, apt_no will take the value of '1', which it got from the portion after the comma in col1. It will also check if there is a portion after the comma in col1, and if there is a value in col2, and if they are different, it will take the value in col2. For example, in row 5, apt_no has the value of '2A', taken from col2, even though col1 shows 'RU' after the comma. Finally, if there is no comma in col1 and col2 is nan, then 'apt_no' just remains nan.

['full_address'] lastly 'full address' will concatenate ['building address'] and ['apt_no'] into 1 string in the format of building address, apt_no (shown above). If 'apt_no' is nan, then the 'full address' will just be the same as 'col1'

I've been struggling with this for hours but haven't come up with a way to do it. Thanks for looking.


Solution

  • Here's code that gives your desired result. I reset the apt_no to null at the end to match your solution.

    data['building_address']=data['col1'].str.split(',').str[0]
    data['apt_no']=data['col1'].str.split(',').str[1]
    data['apt_no'][data['apt_no'].isnull()]=data['col2'][data['apt_no'].isnull()]
    data['apt_no'][(data['apt_no'].isnull()) | (data['apt_no']=='nan')]=''
    data['full_address']=(data['building_address']+', '+data['apt_no']).str.rstrip(', ')
    #Reset to null
    data['apt_no'][data['apt_no']=='']=np.nan