Search code examples
pythonpandasnumpyfeature-extractionfeature-engineering

Conversion of years and months to months with string column input python


dataset example:

experience

5 month
nan
1 months
8 month
17 months
8 year
11 years
1.7 year
3.1 years
15.7 months
18 year
2017.2 years
98.3 years
68 year

I have a column with applicant's experience in years. It is very messy and I tried to go through it and create a sample. I have numbers followed by (month or months) and (year of years).

There are many nan entries and it should be ignored.

The goal is to create a column experience in months:

if nan
  copy nan to the corresponding column
if the row has month or months 
  copy the number to the corresponding column
if year or years in the row and the number <55 
  the number shall be multiplied by 12 and copied to the corresponding column
else copy nan to the corresponding column

How to achieve this?


Solution

  • my_dict = {'Experience': ['5 month', 'nan', '1 months', '8 month','17 months','8 year',
                              '11 years','1.7 year', '3.1 years', '15.7 months','18 year',
                              '2017.2 years', '98.3 years', '68 year']}
    
    df = pd.DataFrame(my_dict)
    
    # Create filter for month/months
    month_filt = df['Experience'].str.contains('month')
    
    # Filter DataFrame for rows that contain month/months 
    df['Months'] = df.loc[month_filt, 'Experience'].str.strip('month|months')
    
    # Create filter for year/years
    year_filt = df['Experience'].str.contains('year')
    
    # Filter DataFrame for rows that contain year/years
    df['Years'] = df.loc[year_filt, 'Experience'].str.strip('year|years')
    
    # Fill NaN in Years column
    df.loc[df['Years'].isna(),'Years'] = np.nan
    
    # Convert Years to months
    df.loc[df['Months'].isna(),'Months'] = df['Years'].astype('float') * 12
    
    # Set years greater than 55 to NaN
    df.loc[df['Years'].astype('float') > 55, 'Months'] = np.nan
    
        Experience  Months  Years
    0   5 month     5       NaN
    1   nan         NaN     NaN
    2   1 months    1       NaN
    3   8 month     8       NaN
    4   17 months   17      NaN
    5   8 year      96      8
    6   11 years    132     11
    7   1.7 year    20.4    1.7
    8   3.1 years   37.2    3.1
    9   15.7 months 15.7    NaN
    10  18 year     216     18
    11  2017.2 yearsNaN 2017.2
    12  98.3 years  NaN     98.3
    13  68 year     NaN     68