Search code examples
pythonpandasdataframesplitexplode

Split/explode cells into multiple rows based on conditions in pandas dataframe


The code for input dataframe is

import pandas as pd
df = pd.DataFrame([{'Column1': '((CC ) + (A11/ABC/ZZ) + (!AAA))','Column2': 'XYZ + XXX/YYY'}])

Input Dataframe:-

+---------------------------------+---------------------------------+
|              Column1            |              Column2            +
+---------------------------------+---------------------------------+
| ((CC ) + (A11/ABC/ZZ) + (!AAA)) |           XYZ + XXX/YYY         |
+---------------------------------+---------------------------------+

Input list:-

list = [AAA,BBB,CCC]

Conditions:-

'+' should remain as such (similar to AND condition)
'/' means split the data into multiple cells (similar to OR condition)
'!' means replace with other elements in the corresponding list (similar to NOT condition)

Because of the ! sign, the row becomes

+------------------------------------+---------------------------------+
|              Column1               |              Column2            +
+------------------------------------+---------------------------------+
| ((CC ) + (A11/ABC/ZZ) + (BBB/CCC)) |           XYZ + XXX/YYY         |
+------------------------------------+---------------------------------+

Please help me to split the singe row into multiple rows as shown below using Pandas

+---------------------------------+---------------------------------+
|              Column1            |              Column2            +
+---------------------------------+---------------------------------+
|          CC + A11 + BBB         |             XYZ + XXX           |
+---------------------------------+---------------------------------+
|          CC + ABC + BBB         |             XYZ + XXX           |
+---------------------------------+---------------------------------+
|          CC + ZZ + BBB          |             XYZ + XXX           |
+---------------------------------+---------------------------------+
|          CC + A11 + CCC         |             XYZ + XXX           |
+---------------------------------+---------------------------------+
|          CC + ABC + CCC         |             XYZ + XXX           |
+---------------------------------+---------------------------------+
|          CC + ZZ + CCC          |             XYZ + XXX           |
+---------------------------------+---------------------------------+
|          CC + A11 + BBB         |             XYZ + YYY           |
+---------------------------------+---------------------------------+
|          CC + ABC + BBB         |             XYZ + YYY           |
+---------------------------------+---------------------------------+
|          CC + ZZ + BBB          |             XYZ + YYY           |
+---------------------------------+---------------------------------+
|          CC + A11 + CCC         |             XYZ + YYY           |
+---------------------------------+---------------------------------+
|          CC + ABC + CCC         |             XYZ + YYY           |
+---------------------------------+---------------------------------+
|          CC + ZZ + CCC          |             XYZ + YYY           |
+---------------------------------+---------------------------------+

Solution

  • See whether this meets your requirements. The comments explain how it works.

    #!/usr/bin/env python
    import pandas as pd # tested with pd.__version__ 0.19.2
    df = pd.DataFrame([{'Column1': '((CC ) + (A11/ABC/ZZ) + (!AAA))',
                        'Column2': 'XYZ + XXX/YYY'}])   # your input dataframe
    list = ['AAA', 'BBB', 'CCC']                        # your input list
    to_replace = dict()
    for item in list:   # prepare the dictionary for the '!' replacements
        to_replace["!"+item+'\\b'] = '/'.join([i for i in list if i != item])
    df = df.replace(to_replace, regex=True) # do all the '!' replacements
    import re
    def expanded(s):    # expand series s to multiple string list around '/'
        l = s.str.replace('[()]', '').tolist()
        while True:     # in each loop cycle, handle one A/B/C... expression
            xl = []     # expanded list for this cycle
            for s in l: # for each string in the list so far
                m = re.search(r'\w+(/\w+)+', s) # look for a A/B/C... expression
                if m:   # if there is, add the individual expansions to the list
                    xl.extend([m.string[:m.start()]+i+m.string[m.end():]
                                                for i in m.group().split('/')])
                else:   # if not, we're done
                    return l
            l = xl      # expanded list for this cycle is now the current list
    def expand(c):      # expands the column named c to multiple rows
        new = expanded(df[c])                       # get the new contents
        xdf = pd.concat(len(new)/len(df[c])*[df])   # create required rows
        xdf[c] = sorted(new)                        # set the new contents
        return xdf                                  # return new dataframe
    df = expand('Column1')
    df = expand('Column2')
    print df
    

    Output:

               Column1    Column2
    0  CC  + A11 + BBB  XYZ + XXX
    0  CC  + A11 + CCC  XYZ + XXX
    0  CC  + ABC + BBB  XYZ + XXX
    0  CC  + ABC + CCC  XYZ + XXX
    0   CC  + ZZ + BBB  XYZ + XXX
    0   CC  + ZZ + CCC  XYZ + XXX
    0  CC  + A11 + BBB  XYZ + YYY
    0  CC  + A11 + CCC  XYZ + YYY
    0  CC  + ABC + BBB  XYZ + YYY
    0  CC  + ABC + CCC  XYZ + YYY
    0   CC  + ZZ + BBB  XYZ + YYY
    0   CC  + ZZ + CCC  XYZ + YYY