Search code examples
pythonpandassplitdelimiter

Splitting pandas column with two delimiters creates extra empty columns


I load in data from a CSV, one of the columns has this format:

!Color1:Color2:Color3!

!White:Green:Black!
!Green:Blue:Yellow!!Red:Brown:Blue!!White:Green:Black!
!Green:Blue:Yellow!!White:Green:Black!
!Red:Brown:Blue!!White:Green:Black!

I want to discard all of the other columns, pick this one out, then split this one into this:

     0                      1                     2
0    White:Green:Black      None                  None
1    Green:Blue:Yellow      Red:Brown:Blue        White:Green:Black
2    Green:Blue:Yellow      White:Green:Black     None
3    Red:Brown:Blue         White:Green:Black     None

Below is how I tried to do it:

df = pd.read_csv(csv_path, index_col=False)
new_df = df['!Color1:Color2:Color3!'].str.split('!', expand=True)

But it ends up like this:

     0        1                      2          3                     4         5
0    None     White:Green:Black      None       None                  None      None
1    None     Green:Blue:Yellow      None       Red:Brown:Blue        None      White:Green:Black
2    None     Green:Blue:Yellow      None       White:Green:Black     None      None
3    None     Red:Brown:Blue         None       White:Green:Black     None      None

So it interprets the first "!" as a field of its own, and so it adds empty fields between the "parts".

Bonus question:
After that is achieved, how do I pick out the middle color in each column, like this?:

     0           1           2
0    Green       None        None
1    Blue        Brown       Green
2    Blue        Green       None
3    Brown       Green       None

Solution

  • Add Series.str.strip for avoid last and first columns filled by empty strings and regex !{1,} for split 1 or multiple !:

    new_df = df['!Color1:Color2:Color3!'].str.strip('!').str.split('!{1,}', expand=True)
    print (new_df)
                       0                  1                  2
    0  White:Green:Black               None               None
    1  Green:Blue:Yellow     Red:Brown:Blue  White:Green:Black
    2  Green:Blue:Yellow  White:Green:Black               None
    3     Red:Brown:Blue  White:Green:Black               None
    

    Also if need second splitted values by : use custom lambda function:

    new_df = (df['!Color1:Color2:Color3!']
                   .str.strip('!')
                   .str.split('!{1,}', expand=True)
                   .apply(lambda x: x.str.split(':').str[1]))
    print (new_df)
           0      1      2
    0  Green   None   None
    1   Blue  Brown  Green
    2   Blue  Green   None
    3  Brown  Green   None