Search code examples
pythonpandasdataframesplitdelimiter

pandas split by last delimiter


I have the following column in a dataframe with different outputs:

col1
MLB|NBA|NFL
MLB|NBA
NFL|NHL|NBA|MLB

I would like to use the split function to split the column by the last pipe always so something like this:

col1           col2
MLB|NBA        NFL
MLB            NBA
NFL|NHL|NBA    MLB

Solution

  • With Series.str.rsplit, limiting the number of splits.

    df['col1'].str.rsplit('|', n=1, expand=True).rename(columns=lambda x: f'col{x + 1}')
    

    If the above throws you a SyntaxError, it means you're on a python version older than 3.6 (shame on you!). Use instead

    df.col1.str.rsplit('|', 1, expand=True)\
      .rename(columns=lambda x: 'col{}'.format(x + 1))
    
              col1 col2
    0      MLB|NBA  NFL
    1          MLB  NBA
    2  NFL|NHL|NBA  MLB
    

    There's also the faster loopy str.rsplit equivalent.

    pd.DataFrame(
        [x.rsplit('|', 1) for x in df['col1'].tolist()], 
        columns=['col1', 'col2']
    ) 
              col1 col2
    0      MLB|NBA  NFL
    1          MLB  NBA
    2  NFL|NHL|NBA  MLB
    

    P.S., yes, the second solution is faster:

    df = pd.concat([df] * 100000, ignore_index=True)
    
    %timeit df['col1'].str.rsplit('|', n=1, expand=True)
    %timeit pd.DataFrame([x.rsplit('|', 1) for x in df['col1'].tolist()])
    
    473 ms ± 13.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    128 ms ± 1.29 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)