Search code examples
pythonpandasdataframedata-science

Replace a part of string in Pandas Data Column, replace doesn't work


I have been trying to clean my data column by taking a part of the text out. Unfortunately cannot get my head around it.

I tried using the .replace method in pandas series, but that did not seem to have worked

df['Salary Estimate'].str.replace(' (Glassdoor est.)', '',regex=True)


0       $53K-$91K (Glassdoor est.)
1      $63K-$112K (Glassdoor est.)
2       $80K-$90K (Glassdoor est.)
3       $56K-$97K (Glassdoor est.)
4      $86K-$143K (Glassdoor est.)
                  ...             
922                             -1
925                             -1
928    $59K-$125K (Glassdoor est.)
945    $80K-$142K (Glassdoor est.)
948    $62K-$113K (Glassdoor est.)
Name: Salary Estimate, Length: 600, dtype: object

What I expected was



0       $53K-$91K
1      $63K-$112K
2       $80K-$90K
3       $56K-$97K
4      $86K-$143K
                  ...             
922                             -1
925                             -1
928    $59K-$125K
945    $80K-$142K
948    $62K-$113K
Name: Salary Estimate, Length: 600, dtype: object`


Solution

  • If you enable regex, you have to escape regex symbol like (, ) or .:

    import re
    
    >>> df['Salary Estimate'].str.replace(re.escape(r' (Glassdoor est.)'), '',regex=True)
    0     $53K-$91K
    1    $63K-$112K
    2     $80K-$90K
    3     $56K-$97K
    4    $86K-$143K
    Name: Salary Estimate, dtype: object
    
    # Or without import re module
    >>> df['Salary Estimate'].str.replace(r' \(Glassdoor est\.\)', '',regex=True)
    0     $53K-$91K
    1    $63K-$112K
    2     $80K-$90K
    3     $56K-$97K
    4    $86K-$143K
    Name: Salary Estimate, dtype: object
    

    You can also extract numbers:

    >>> df['Salary Estimate'].str.extract(r'\$(?P<min>\d+)K-\$(?P<max>\d+)K')
      min  max
    0  53   91
    1  63  112
    2  80   90
    3  56   97
    4  86  143