I have a column bike_name and I want to know the easiest way to split it into year and CC.
CC should contain the numeric data attached before the word cc. In some cases, where cc is not available, it should remain blank.
While year contains just the year in the last word.
TVS Star City Plus Dual Tone 110cc 2018
Royal Enfield Classic 350cc 2017
Triumph Daytona 675R 2013
TVS Apache RTR 180cc 2017
Yamaha FZ S V 2.0 150cc-Ltd. Edition 2018
Yamaha FZs 150cc 2015
You can extract them separately: year
is the last 4 characters, CC
is via a regex:
df["year"] = df.bike_name.str[-4:]
df["CC"] = df.bike_name.str.extract(r"(\d+)cc").fillna("")
where regex is looking for sequence of digits followed literally by "cc"
and in case of no match, it will give NaN
s; so we fill them with empty string,
to get
bike_name year CC
0 TVS Star City Plus Dual Tone 110cc 2018 2018 110
1 Royal Enfield Classic 350cc 2017 2017 350
2 Triumph Daytona 675R 2013 2013
3 TVS Apache RTR 180cc 2017 2017 180
4 Yamaha FZ S V 2.0 150cc-Ltd. Edition 2018 2018 150
5 Yamaha FZs 150cc 2015 2015 150
df.bike_name = (df.bike_name.str[:-4]
.str.replace(r"\d+cc", "", regex=True)
.str.rstrip())
where first line removes year, second line removes the cc parts and lastly we r
ight strip
all the rows if space at the end is unwanted,
to get
>>> df
bike_name year CC
0 TVS Star City Plus Dual 2018 110
1 Royal Enfield Cla 2017 350
2 Triumph Daytona 2013
3 TVS Apache 2017 180
4 Yamaha FZ S V 2.0 -Ltd. Edi 2018 150
5 Yamaha 2015 150