Search code examples
pythonregexpandasdataframetrim

Strip / trim all strings of a dataframe


Cleaning the values of a multitype data frame in python/pandas, I want to trim the strings. I am currently doing it in two instructions :

import pandas as pd

df = pd.DataFrame([['  a  ', 10], ['  c  ', 5]])

df.replace('^\s+', '', regex=True, inplace=True) #front
df.replace('\s+$', '', regex=True, inplace=True) #end

df.values

This is quite slow, what could I improve ?


Solution

  • You can use DataFrame.select_dtypes to select string columns and then apply function str.strip.

    Notice: Values cannot be types like dicts or lists, because their dtypes is object.

    df_obj = df.select_dtypes('object')
    #if need also processing string categories
    #df_obj = df.select_dtypes(['object', 'category'])
    print (df_obj)
    0    a  
    1    c  
    
    df[df_obj.columns] = df_obj.apply(lambda x: x.str.strip())
    print (df)
    
       0   1
    0  a  10
    1  c   5
    

    But if there are only a few columns use str.strip:

    df[0] = df[0].str.strip()