I am trying to loop through a column in a pandas data frame to remove unnecessary white space in the beginning and end of the strings within the column. My data frame looks like this:
df={'c1': [' ab', 'fg', 'ac ', 'hj-jk ', ' ac', 'df, gh', 'gh', 'ab', 'ad', 'jk-pl', 'ae', 'kl-kl '], 'b2': ['ba', 'bc', 'bd', 'be', 'be', 'be', 'ba'] }
c1 b2
0 ab, fg
1 ac, hj-jk
2 ac, df,gh
3 gh, be
4 ab, be
5 ad, jk-pl
6 ae, kl-kl
I tried the this answer here, but did not work either. The reason I need to remove the white space from the strings in this column is that I want to one hot encode this column using get.dummies() function. My idea was to use the strip() function to remove the white space from each value and then I use .str.get_dummies(','):
#function to remove white space from strings
def strip_string(dataframe, column_name):
for id, item in dataframe[column_name].items():
a=item.strip()
#removing the white space from the values of the column
strip_string(df, 'c1')
#creating one hot-encoded columns from the values using split(",")
df1=df['c1'].str.get_dummies(',')
but my code returns duplicate columns and I don't want this...I suppose the function to remove the white space is not working well? Can anyone help? My current output is:
ab ac df fg gh hj-jk jk-pl kl-kl ab ac ad ae gh
0 1 0 0 1 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 1 0 0 0 1 0 0 0
2 0 1 1 0 1 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 0 0 1
4 0 0 0 0 0 0 0 0 1 0 0 0 0
5 0 0 0 0 0 0 1 0 0 0 1 0 0
6 0 0 0 0 0 0 0 1 0 0 0 1 0
columns 'ac' and 'ab' are duplicated. I want to remove the duplicated columns
I would stack
, strip
, get_dummies
, and groupby.max
:
If the separator is ', '
:
df.stack().str.strip().str.get_dummies(sep=', ').groupby(level=0).max()
else:
df.stack().str.replace(r'\s', '', regex=True).str.get_dummies(sep=',').groupby(level=0).max()
output:
ab ac ba bc bd be df fg gh hj-jk
0 1 0 1 0 0 0 0 0 0 0
1 0 0 0 1 0 0 0 1 0 0
2 0 1 0 0 1 0 0 0 0 0
3 0 0 0 0 0 1 0 0 0 1
4 0 1 0 0 0 1 0 0 0 0
5 0 0 0 0 0 1 1 0 1 0
6 0 0 1 0 0 0 0 0 1 0