I have been trying to clean my data column by taking a part of the text out. Unfortunately cannot get my head around it.
I tried using the .replace method in pandas series, but that did not seem to have worked
df['Salary Estimate'].str.replace(' (Glassdoor est.)', '',regex=True)
0 $53K-$91K (Glassdoor est.)
1 $63K-$112K (Glassdoor est.)
2 $80K-$90K (Glassdoor est.)
3 $56K-$97K (Glassdoor est.)
4 $86K-$143K (Glassdoor est.)
...
922 -1
925 -1
928 $59K-$125K (Glassdoor est.)
945 $80K-$142K (Glassdoor est.)
948 $62K-$113K (Glassdoor est.)
Name: Salary Estimate, Length: 600, dtype: object
What I expected was
0 $53K-$91K
1 $63K-$112K
2 $80K-$90K
3 $56K-$97K
4 $86K-$143K
...
922 -1
925 -1
928 $59K-$125K
945 $80K-$142K
948 $62K-$113K
Name: Salary Estimate, Length: 600, dtype: object`
If you enable regex, you have to escape regex symbol like (
, )
or .
:
import re
>>> df['Salary Estimate'].str.replace(re.escape(r' (Glassdoor est.)'), '',regex=True)
0 $53K-$91K
1 $63K-$112K
2 $80K-$90K
3 $56K-$97K
4 $86K-$143K
Name: Salary Estimate, dtype: object
# Or without import re module
>>> df['Salary Estimate'].str.replace(r' \(Glassdoor est\.\)', '',regex=True)
0 $53K-$91K
1 $63K-$112K
2 $80K-$90K
3 $56K-$97K
4 $86K-$143K
Name: Salary Estimate, dtype: object
You can also extract numbers:
>>> df['Salary Estimate'].str.extract(r'\$(?P<min>\d+)K-\$(?P<max>\d+)K')
min max
0 53 91
1 63 112
2 80 90
3 56 97
4 86 143