Search code examples
pythonpandasdata-preprocessing

How to split strings values in a column


I have a dataset that has a column ('facilities') with object data type and several missing values and has a string value without spaces. as shown below:

attachment

How to add space to them? I have tried some codes as below but it doesn't work:

X['Restaurant'] = X['facilities'].apply(lambda x: 1 if 'Restaurant' in x else 0)
X['BAR'] = X['facilities'].apply(lambda x: 1 if 'BAR' in x else 0)
X['SwimmingPools'] = X['facilities'].apply(lambda x: 1 if 'SwimmingPools' in x else 0)
df3 = X['facilities'].str.split(n=1, expand=True)
df3.columns = ['STATUS_ID{}'.format(x+1) for x in df3.columns]

Solution

  • You can use re.split to split the words into a list, then .join the list using whitespaces as separators:

    import pandas as pd
    import re
    
    df = pd.DataFrame({"facilities":["GymrestaurantbarInternetSwimmingPools",
                                    "Poolrestaurantgyminternetbar",
                                    "BARswimmingPoolsInternetgym"]})
    
    #                               facilities
    # 0  GymrestaurantbarInternetSwimmingPools
    # 1           Poolrestaurantgyminternetbar
    # 2            BARswimmingPoolsInternetgym
    
    pattern = '(gym|restaurant|internet|swimmingpools|bar)' #Add all the words you want to separate by here 
    
    df["facilities_cleaned"] = df.apply(lambda x: " ".join([word for word in re.split(pattern=pattern, string=x["facilities"].lower()) if len(word)>0]), axis=1)
    
    #                               facilities                         facilities_cleaned
    # 0  GymrestaurantbarInternetSwimmingPools  gym restaurant bar internet swimmingpools
    # 1           Poolrestaurantgyminternetbar           pool restaurant gym internet bar
    # 2            BARswimmingPoolsInternetgym             bar swimmingpools internet gym