all i did was to switch places of 'year' and 'years', from first line to second line and vice versa..
here is the original column
10+ years 653
< 1 year 249
2 years 243
3 years 235
5 years 202
4 years 191
1 year 177
6 years 163
7 years 127
8 years 108
9 years 72
. 2
Name: Employment.Length, dtype: int64
first example('years' on first line, 'year' on second line)
raw_data['Employment.Length'] = raw_data['Employment.Length'].str.replace('years',' ')
raw_data['Employment.Length'] = raw_data['Employment.Length'].str.replace('year',' ')
raw_data['Employment.Length'] = np.where(raw_data['Employment.Length'].str[:2]=='10',10,raw_data['Employment.Length'])
raw_data['Employment.Length'] = np.where(raw_data['Employment.Length'].str[0]=='<',0,raw_data['Employment.Length'])
raw_data['Employment.Length'] = pd.to_numeric(raw_data['Employment.Length'], errors = 'coerce')
output
10.0 653
0.0 249
2.0 243
3.0 235
5.0 202
4.0 191
1.0 177
6.0 163
7.0 127
8.0 108
9.0 72
Name: Employment.Length, dtype: int64
second example('year' on first line, 'years' on second line)
raw_data_copy['Employment.Length'] = raw_data_copy['Employment.Length'].str.replace('year',' ')
raw_data_copy['Employment.Length'] = raw_data_copy['Employment.Length'].str.replace('years',' ')
raw_data_copy['Employment.Length'] = np.where(raw_data_copy['Employment.Length'].str[:2]=='10',10, raw_data_copy['Employment.Length'])
raw_data_copy['Employment.Length'] = np.where(raw_data_copy['Employment.Length'].str[0]=='<',0,raw_data_copy['Employment.Length'])
raw_data_copy['Employment.Length'] = pd.to_numeric(raw_data_copy['Employment.Length'], errors = 'coerce')
output
10.0 653
0.0 249
1.0 177
Name: Employment.Length, dtype: int64
and one more thing is that when i comment out my second line with 'year' in it, it gives me the same output as the first example. and when i when i comment out my second line with 'years' in it, it gives me the same output as the second example.
third example
raw_data_copy['Employment.Length'] = raw_data_copy['Employment.Length'].str.replace('years',' ')
#raw_data_copy['Employment.Length'] = raw_data_copy['Employment.Length'].str.replace('years',' ')
raw_data_copy['Employment.Length'] = np.where(raw_data_copy['Employment.Length'].str[:2]=='10',10, raw_data_copy['Employment.Length'])
raw_data_copy['Employment.Length'] = np.where(raw_data_copy['Employment.Length'].str[0]=='<',0,raw_data_copy['Employment.Length'])
raw_data_copy['Employment.Length'] = pd.to_numeric(raw_data_copy['Employment.Length'], errors = 'coerce')
output
10.0 653
0.0 249
2.0 243
3.0 235
5.0 202
4.0 191
6.0 163
7.0 127
8.0 108
9.0 72
Name: Employment.Length, dtype: int64
If you first replace 'year'
with ' '
then 'years'
becomes ' s'
, and the 's'
no longer gets replaced by your subsequent str.replace('years', ' ')
.
Instead of multiple subsequent replacements use one with an optional s
: 'year[s]?'
import pandas as pd
s = pd.Series(['year', 'years', 'foo'])
s.str.replace('year[s]?', ' ')
#0
#1
#2 foo
#dtype: object