Search code examples
pythonpandasvisual-studio-codedata-sciencedata-cleaning

Index issues while data cleaning


Original code credit-Ken Jee

Salary = df['Salary Estimate'].apply(lambda x:x.split('(')[0])
minus_Kd = Salary.apply(lambda x:x.replace('K','').replace('$',''))

min_hr=minus_Kd.apply(lambda x:x.lower().replace('per hour;','').replace('employer 
provided salary:',''))

df['min_salary'] = min_hr.apply(lambda x: x.split('-')[0])
df['max_salary'] = min_hr.apply(lambda x: x.split('-')[1])
df['avg_salary']=(df.min_salary+df.max_salary)/2

Having the following error when running this portion of my code:

  df['max_salary'] = min_hr.apply(lambda x: x.split('-')[1])
  IndexError: list index out of range

Here is some sample data that I am attempting to parse through:

Employer Provided Salary:$78K - $191K

$77K - $107K (Glassdoor est.)

Being a zero index, shouldn`t 0 and 1 be left/right of the "-" respectively?


Solution

  • Using .str accessor with extract, regex and eval:

    Given df,

    df = pd.DataFrame({'Job':['Job 1', 'Job 2', 'Job 3'],
                       'Salary':['$78K - $191K', '$77K - $107K', '$100K']})
    

    Input df:

         Job        Salary
    0  Job 1  $78K - $191K
    1  Job 2  $77K - $107K
    2  Job 3         $100K
    

    Option 1

    Using regex

    df['Salary'].str.extract('\$(?P<lower>\d+)(?:.*\$(?P<higher>\d+)K)?').astype('float').eval('average = (lower+higher)/2')
    

    Output:

       lower  higher  average
    0   78.0   191.0    134.5
    1   77.0   107.0     92.0
    2  100.0     NaN      NaN
    

    Option 2:

    Using list comprehension:

    df['Min Salary'] = [int(x.split('-')[0].strip().strip('\$|K')) for x in df['Salary']]
    df['Max Salary'] = [int(x.split('-')[-1].strip().strip('\$|K')) for x in df['Salary']]
    df['Avg Salary'] = (df['Min Salary'] + df['Max Salary'])/2
    

    Output:

         Job        Salary  Min Salary  Max Salary  Avg Salary
    0  Job 1  $78K - $191K          78         191       134.5
    1  Job 2  $77K - $107K          77         107        92.0
    2  Job 3         $100K         100         100       100.0
    

    Which leads me to the -1 index trick that way you don't have to worry about night having a second value if there isn't '-'.