I'm stripping punctuation from strings contained within a Pandas dataframe. For example:
import pandas as pd
df = pd.DataFrame(data = [['a.b', 'c_d', 'e^f'],['g*h', 'i@j', 'k&l']],
columns = ['column 1', 'column 2', 'column 3'])
I've succeeded in stripping punctuation within a column using list comprehension:
import string
df_nopunct = [line.translate(str.maketrans('', '', string.punctuation))
for line in list(df['column 1'])]
# ['ab', 'gh']
But what I'd really like to do is strip punctuation across the entire dataframe, saving this as a new dataframe.
If I try the same approach on the entire dataframe, it seems to just return a list of my column names:
df_nopunct = [line.translate(str.maketrans('', '', string.punctuation))
for line in list(df)]
# ['column 1', 'column 2', 'column 3']
Should I iterate line.translate(str.maketrans('', '', string.punctuation))
across columns, or is there a simpler way to accomplish this?
I've looked at the detailed answer about how to strip punctuation but it looks like that article deals with stripping from a single string, rather than across an entire dataframe.
You could do direct df.replace
as follows
import string
df_trans = df.replace('['+string.punctuation+']', '', regex=True)
Out[766]:
column 1 column 2 column 3
0 ab cd ef
1 gh ij kl
If you prefer using translate
, use dict comprehension with str.translate
on each column and construct new dataframe
import string
trans = str.maketrans('', '', string.punctuation)
df_trans = pd.DataFrame({col: df[col].str.translate(trans) for col in df})
Out[746]:
column 1 column 2 column 3
0 ab cd ef
1 gh ij kl