I have a pandas dataset like this:
import pandas as pd
data = {'id': ['001', '002', '003','004'],
'address': ["William J. Clare\\n290 Valley Dr.\\nCasper, WY 82604\\nUSA",
"1180 Shelard Tower\\nMinneapolis, MN 55426\\nUSA",
"William N. Barnard\\n145 S. Durbin\\nCasper, WY 82601\\nUSA",
"215 S 11th ST"],
'locality': [None, None, None,'Laramie'],
'region': [None, None, None, 'WY'],
'Zipcode': [None, None, None, '87656'],
'Country': [None, None, None, 'US']
}
df = pd.DataFrame(data)
I tried to split the address column by new line but however since it has two \ followed by n. I am not able to do . Please help me in splitting the \n from address and exptrapolate into locality region zipcode and country.
sample output:
id address locality region Zipcode Country
1 290 Valley Dr. Casper WY 82604 USA
2 1180 Shelard Tower Minneapolis MN 55426 USA
3 145 S. Durbin Casper WY 82601 USA
4 215 S 11th ST Laramie WY 87656 US
I tried different methods to split \n using split command but it gives me extra \. And I am trying to keep it in pandas dataframe so that I can carry further analysis.
Here is an approach using extract
instead of split and update
in place:
df.update(df['address'].str.extract(r'([^,]+)(?:,\s(\w+)\s*(\d+)\\n(\w+))?$')
.set_axis(["address", "region", "Zipcode", "Country"], axis=1)
)
df['name'] = pd.NA
df.update(df['address'].str.extract(r'(?:(.*?)\\n)?(.*)\\n(.+)')
.set_axis(['name', 'address', 'locality'], axis=1)
)
output:
id address locality region Zipcode Country name
0 001 290 Valley Dr. Casper WY 82604 USA William J. Clare
1 002 1180 Shelard Tower Minneapolis MN 55426 USA <NA>
2 003 145 S. Durbin Casper WY 82601 USA William N. Barnard
3 004 215 S 11th ST Laramie WY 87656 US <NA>