I'm having a really challenging time coming up with the code needed to perform the below. There's a similar question with this but I can't figure out how to adapt the code to what my needs are particularly.
I have a pandas data frame that is over 100k rows in length. Here is what the address and apartment number format looks like currently:
Current DF:
temp = {'col1': ['220 CENTRAL STREET, 50', '165 EAST 66TH ST, RESI', '106 SPRUCE STREET, 1', '14 EAST 67TH STREET', '1131 OGEN AVENUE', '200 EAST 1ST STREET, RU', '520 PARK LANE', '520 PARK LANE', '80 BAY STREET LANDING, 1A', '520 PARK SOUTH, DPH54', '520 PARK LANE', '62 VEST STREET', '256 FLARIN AVENUE'], 'col2':['50', 'RESI', 'nan', 'nan', 'nan', '2A', 'DPH60', 'DPH56', '1A', 'DPH54', 'DPH52', '21F', 'nan']}
data = pd.DataFrame(temp)
data
col1 col2
0 220 CENTRAL STREET, 50 50
1 165 EAST 66TH ST, RESI RESI
2 106 SPRUCE STREET, 1 nan
3 14 EAST 67TH STREET nan
4 1131 OGEN AVENUE nan
5 200 EAST 1ST STREET, RU 2A
6 520 PARK LANE DPH60
7 520 PARK LANE DPH56
8 80 BAY STREET LANDING, 1A 1A
9 520 PARK SOUTH, DPH54 DPH54
10 520 PARK LANE DPH52
11 62 VEST STREET 21F
12 256 FLARIN AVENUE nan
Desired DF (data1), which adds 3 new columns to allow for different levels of granularity later on:
temp1 = {'col1': ['220 CENTRAL STREET, 50', '165 EAST 66TH ST, RESI', '106 SPRUCE STREET, 1', '14 EAST 67TH STREET', '1131 OGEN AVENUE', '200 EAST 1ST STREET, RU', '520 PARK LANE', '520 PARK LANE', '80 BAY STREET LANDING, 1A', '520 PARK SOUTH, DPH54', '520 PARK LANE', '62 VEST STREET', '256 FLARIN AVENUE'],
'col2':['50', 'RESI', 'nan', 'nan', 'nan', '2A', 'DPH60', 'DPH56', '1A', 'DPH54', 'DPH52', '21F', 'nan'],
'building_address':['220 CENTRAL STREET', '165 EAST 66TH ST', '106 SPRUCE STREET', '14 EAST 67TH STREET', '1131 OGEN AVENUE', '200 EAST 1ST STREET', '520 PARK LANE', '520 PARK LANE', '80 BAY STREET LANDING', '520 PARK SOUTH', '520 PARK LANE', '62 VEST STREET', '256 FLARIN AVENUE'],
'apt_no': ['50', 'RESI', '1', 'nan', 'nan', '2A', 'DPH60', 'DPH56', '1A', 'DPH54', 'DPH52', '21F', 'nan'],
'full_address':['220 CENTRAL STREET, 50', '165 EAST 66TH ST, RESI', '106 SPRUCE STREET, 1', '14 EAST 67TH STREET', '1131 OGEN AVENUE', '200 EAST 1ST STREET, 2A', '520 PARK LANE, DPH60', '520 PARK LANE, DPH56', '80 BAY STREET LANDING, 1A', '520 PARK SOUTH, DPH54', '520 PARK LANE, DPH52', '62 VEST STREET, 21F', '256 FLARIN AVENUE']}
data1 = pd.DataFrame(temp1)
data1
col1 col2 building_address apt_no \
0 220 CENTRAL STREET, 50 50 220 CENTRAL STREET 50
1 165 EAST 66TH ST, RESI RESI 165 EAST 66TH ST RESI
2 106 SPRUCE STREET, 1 nan 106 SPRUCE STREET 1
3 14 EAST 67TH STREET nan 14 EAST 67TH STREET nan
4 1131 OGEN AVENUE nan 1131 OGEN AVENUE nan
5 200 EAST 1ST STREET, RU 2A 200 EAST 1ST STREET 2A
6 520 PARK LANE DPH60 520 PARK LANE DPH60
7 520 PARK LANE DPH56 520 PARK LANE DPH56
8 80 BAY STREET LANDING, 1A 1A 80 BAY STREET LANDING 1A
9 520 PARK SOUTH, DPH54 DPH54 520 PARK SOUTH DPH54
10 520 PARK LANE DPH52 520 PARK LANE DPH52
11 62 VEST STREET 21F 62 VEST STREET 21F
12 256 FLARIN AVENUE nan 256 FLARIN AVENUE nan
full_address
0 220 CENTRAL STREET, 50
1 165 EAST 66TH ST, RESI
2 106 SPRUCE STREET, 1
3 14 EAST 67TH STREET
4 1131 OGEN AVENUE
5 200 EAST 1ST STREET, 2A
6 520 PARK LANE, DPH60
7 520 PARK LANE, DPH56
8 80 BAY STREET LANDING, 1A
9 520 PARK SOUTH, DPH54
10 520 PARK LANE, DPH52
11 62 VEST STREET, 21F
12 256 FLARIN AVENUE
In the existing DF (data), col1 is a street address that may or may not contain an apartment number. For simplicity, I'm assuming the values under col1 will have an apartment number if there's a comma. The part after the comma can be considered as the apartment number.
col2 contains the apartment number alone. It has nan's in the column. In some cases, like in row 5, the apartment number in col2 ('2A') will not match the portion after the comma in col1 ('RU'). In other cases, like in row 2, col2 may be nan but col1 has an apartment number after the comma.
What I want to do is add 3 new columns (shown in the Desired DF data1):
['building_address'] will essentially just contain everything before the comma, so it will say '220 CENTRAL STREET' whereas col1 will say '220 CENTRAL STREET, 50'
['apt_no'] will check if there is an nan. If there is, it will then check in col1 for a value after the comma. If the check is successful, it will populate that value in col2. So for example, in data1 row 2, apt_no will take the value of '1', which it got from the portion after the comma in col1. It will also check if there is a portion after the comma in col1, and if there is a value in col2, and if they are different, it will take the value in col2. For example, in row 5, apt_no has the value of '2A', taken from col2, even though col1 shows 'RU' after the comma. Finally, if there is no comma in col1 and col2 is nan, then 'apt_no' just remains nan.
['full_address'] lastly 'full address' will concatenate ['building address'] and ['apt_no'] into 1 string in the format of building address, apt_no (shown above). If 'apt_no' is nan, then the 'full address' will just be the same as 'col1'
I've been struggling with this for hours but haven't come up with a way to do it. Thanks for looking.
Here's code that gives your desired result. I reset the apt_no to null at the end to match your solution.
data['building_address']=data['col1'].str.split(',').str[0]
data['apt_no']=data['col1'].str.split(',').str[1]
data['apt_no'][data['apt_no'].isnull()]=data['col2'][data['apt_no'].isnull()]
data['apt_no'][(data['apt_no'].isnull()) | (data['apt_no']=='nan')]=''
data['full_address']=(data['building_address']+', '+data['apt_no']).str.rstrip(', ')
#Reset to null
data['apt_no'][data['apt_no']=='']=np.nan