Search code examples
pythonpandassplitdelimiter

Text to columns in pandas dataframe


I have a pandas dataset like below:

import pandas as pd

data = {'id':  ['001', '002', '003'],
        'address': ["William J. Clare\n290 Valley Dr.\nCasper, WY 82604\nUSA, United States",
                    "1180 Shelard Tower\nMinneapolis, MN 55426\nUSA, United States",
                    "William N. Barnard\n145 S. Durbin\nCasper, WY 82601\nUSA, United States"]
        }

df = pd.DataFrame(data)

print(df)

I need to convert address column to text delimited by \n and create new columns like name, address line 1, City, State, Zipcode, Country like below:

id  Name   addressline1 City    State   Zipcode Country
1   William J. Clare    290 Valley Dr.  Casper  WY  82604   United States
2   null    1180 Shelard Tower  Minneapolis MN  55426   United States
3   William N. Barnard  145 S. Durbin   Casper  WY  82601   United States

I am learning python and from morning I am solving this. Any help will be greatly appreciated.

Thanks,


Solution

  • Right now, Pandas is returning you the table with 2 columns. If you look at the value in the second column, the essential information is separated with the comma. Therefore, if you saved your dataframe to df you can do the following:

    df['address_and_city'] = df['address'].apply(lambda x: x.split(',')[0])
    df['state_and_postal'] = df['address'].apply(lambda x: x.split(',')[1])
    df['country'] = df['address'].apply(lambda x: x.split(',')[2])
    

    Now, you have additional three columns in your dataframe, the last one contains the full information about the country already. Now from the first two columns that you have created you can extract the info you need in a similar way.

    df['address_first_line'] = df['address_and_city'].apply(lambda x: ' '.join(x.split('\n')[:-1]))
    df['city'] = df['address_and_city'].apply(lambda x: x.split('\n')[-1])
    df['state'] = df['state_and_postal'].apply(lambda x: x.split(' ')[1])
    df['postal'] = df['state_and_postal'].apply(lambda x: x.split(' ')[2].split('\n')[0])
    

    Now you should have all the columns you need. You can remove the excess columns with:

    df.drop(columns=['address','address_and_city','state_and_postal'], inplace=True)
    

    Of course, it all can be done faster and with fewer lines of code, but I think it is the clearest way of doing it, which I hope you will find useful. If you don't understand what I did there, check the documentation for split and join methods, and also for apply method, native to pandas.