I have a dataframe like this
column_name
0 OnePlus phones never fail to meet my expectatiion.
1 received earlier than expected for local set.
2 \n
3 good
4 must buy!
5 \t
6
7 awesome product!
8 \n
I want to remove all rows which contain ONLY \n
, \t
,
, \n
in them.
Output should be like this:
column_name
0 OnePlus phones never fail to meet my expectatiion.
1 received earlier than expected for local set.
2 good
3 must buy!
4 awesome product!
I tried the following method:
df = df[df.column_name != '\n'].reset_index(drop=True)
df = df[df.column_name != ''].reset_index(drop=True)
df = df[df.column_name != ' '].reset_index(drop=True)
df = df[df.column_name != ' '].reset_index(drop=True)
df = df[df.column_name != ' \n '].reset_index(drop=True)
But is there a more elegant way or a pythonic way to do this instead of repeating the code?
You can use Series.str.strip
and compare only empty strings:
df1 = df[df.column_name.str.strip() != ''].reset_index(drop=True)
Or convert empty values to boolean:
df1 = df[df.column_name.str.strip().astype(bool)].reset_index(drop=True)
Or filter words, for me was necessary strip
(maybe in real data strip
should be removed):
df1 = df[df.column_name.str.strip().str.contains('\W', na=False)].reset_index(drop=True)
If need remove missing values and no string values replace these values to NaN
s and then use DataFrame.dropna
:
df.column_name = df.column_name.replace(r'^\s*$', np.nan, regex=True)
df1 = df.dropna(subset=['column_name']).reset_index(drop=True)