Search code examples
python-3.xpandasdataframenlp

How to extract and create new columns from specific match


I have a column bike_name and I want to know the easiest way to split it into year and CC.

CC should contain the numeric data attached before the word cc. In some cases, where cc is not available, it should remain blank.

While year contains just the year in the last word.

TVS Star City Plus Dual Tone 110cc 2018
Royal Enfield Classic 350cc 2017
Triumph Daytona 675R 2013
TVS Apache RTR 180cc 2017
Yamaha FZ S V 2.0 150cc-Ltd. Edition 2018
Yamaha FZs 150cc 2015

Solution

  • You can extract them separately: year is the last 4 characters, CC is via a regex:

    df["year"] = df.bike_name.str[-4:]
    
    df["CC"] = df.bike_name.str.extract(r"(\d+)cc").fillna("")
    

    where regex is looking for sequence of digits followed literally by "cc" and in case of no match, it will give NaNs; so we fill them with empty string,

    to get

                                       bike_name  year   CC
    0    TVS Star City Plus Dual Tone 110cc 2018  2018  110
    1           Royal Enfield Classic 350cc 2017  2017  350
    2                  Triumph Daytona 675R 2013  2013
    3                  TVS Apache RTR 180cc 2017  2017  180
    4  Yamaha FZ S V 2.0 150cc-Ltd. Edition 2018  2018  150
    5                      Yamaha FZs 150cc 2015  2015  150
    

    If not only extraction but also removal is needed:
    df.bike_name = (df.bike_name.str[:-4]
                      .str.replace(r"\d+cc", "", regex=True)
                      .str.rstrip())
    

    where first line removes year, second line removes the cc parts and lastly we right strip all the rows if space at the end is unwanted,

    to get

    >>> df
    
                         bike_name  year   CC
    0      TVS Star City Plus Dual  2018  110
    1            Royal Enfield Cla  2017  350
    2              Triumph Daytona  2013
    3                   TVS Apache  2017  180
    4  Yamaha FZ S V 2.0 -Ltd. Edi  2018  150
    5                       Yamaha  2015  150