Search code examples
python-3.xpandasdata-preprocessing

why do i get different output by placing 'years' and 'year' in my code, in different order in the code


all i did was to switch places of 'year' and 'years', from first line to second line and vice versa..

here is the original column

10+ years    653
< 1 year     249
2 years      243
3 years      235
5 years      202
4 years      191
1 year       177
6 years      163
7 years      127
8 years      108
9 years       72
.              2
Name: Employment.Length, dtype: int64

first example('years' on first line, 'year' on second line)

raw_data['Employment.Length'] = raw_data['Employment.Length'].str.replace('years',' ')
raw_data['Employment.Length'] = raw_data['Employment.Length'].str.replace('year',' ')
raw_data['Employment.Length'] = np.where(raw_data['Employment.Length'].str[:2]=='10',10,raw_data['Employment.Length'])
raw_data['Employment.Length'] = np.where(raw_data['Employment.Length'].str[0]=='<',0,raw_data['Employment.Length'])
raw_data['Employment.Length'] = pd.to_numeric(raw_data['Employment.Length'], errors = 'coerce')

output

10.0    653
0.0     249
2.0     243
3.0     235
5.0     202
4.0     191
1.0     177
6.0     163
7.0     127
8.0     108
9.0      72
Name: Employment.Length, dtype: int64

second example('year' on first line, 'years' on second line)

raw_data_copy['Employment.Length'] = raw_data_copy['Employment.Length'].str.replace('year',' ')
raw_data_copy['Employment.Length'] = raw_data_copy['Employment.Length'].str.replace('years',' ')
raw_data_copy['Employment.Length'] = np.where(raw_data_copy['Employment.Length'].str[:2]=='10',10, raw_data_copy['Employment.Length'])
raw_data_copy['Employment.Length'] = np.where(raw_data_copy['Employment.Length'].str[0]=='<',0,raw_data_copy['Employment.Length'])
raw_data_copy['Employment.Length'] = pd.to_numeric(raw_data_copy['Employment.Length'], errors = 'coerce')

output

10.0    653
0.0     249
1.0     177
Name: Employment.Length, dtype: int64

and one more thing is that when i comment out my second line with 'year' in it, it gives me the same output as the first example. and when i when i comment out my second line with 'years' in it, it gives me the same output as the second example.

third example

 raw_data_copy['Employment.Length'] = raw_data_copy['Employment.Length'].str.replace('years',' ')
    #raw_data_copy['Employment.Length'] = raw_data_copy['Employment.Length'].str.replace('years',' ')
    raw_data_copy['Employment.Length'] = np.where(raw_data_copy['Employment.Length'].str[:2]=='10',10, raw_data_copy['Employment.Length'])
    raw_data_copy['Employment.Length'] = np.where(raw_data_copy['Employment.Length'].str[0]=='<',0,raw_data_copy['Employment.Length'])
    raw_data_copy['Employment.Length'] = pd.to_numeric(raw_data_copy['Employment.Length'], errors = 'coerce')

output

10.0    653
0.0     249
2.0     243
3.0     235
5.0     202
4.0     191
6.0     163
7.0     127
8.0     108
9.0      72
Name: Employment.Length, dtype: int64

Solution

  • If you first replace 'year' with ' ' then 'years' becomes ' s', and the 's' no longer gets replaced by your subsequent str.replace('years', ' ').

    Instead of multiple subsequent replacements use one with an optional s: 'year[s]?'

    import pandas as pd
    s = pd.Series(['year', 'years', 'foo'])
    
    s.str.replace('year[s]?', ' ')
    #0       
    #1       
    #2    foo
    #dtype: object