Search code examples
pythonregexpandassplitrow

Split a row into more rows based on a string (regex)


I have this df and I want to split it:

cities3 = {'Metropolitan': ['New York', 'Los Angeles', 'San Francisco'],
           'NHL': ['RangersIslandersDevils', 'KingsDucks', 'Sharks']}
cities4 = pd.DataFrame(cities3)

cities4

to get a new df like this one: (please click on the images)

goal df

What code can I use?


Solution

  • You can split your column based on an upper-case letter preceded by a lower-case one using this regex:

    (?<=[a-z])(?=[A-Z])
    

    and then you can use the technique described in this answer to replace the column with its exploded version:

    cities4 = cities4.assign(NHL=cities4['NHL'].str.split(r'(?<=[a-z])(?=[A-Z])')).explode('NHL')
    

    Output:

        Metropolitan        NHL
    0       New York    Rangers
    0       New York  Islanders
    0       New York     Devils
    1    Los Angeles      Kings
    1    Los Angeles      Ducks
    2  San Francisco     Sharks
    

    If you want to reset the index (to 0..5) you can do this (either after the above command or as a part of it)

    cities4.reset_index().reindex(cities4.columns, axis=1)
    

    Output:

        Metropolitan        NHL
    0       New York    Rangers
    1       New York  Islanders
    2       New York     Devils
    3    Los Angeles      Kings
    4    Los Angeles      Ducks
    5  San Francisco     Sharks