Search code examples
pythonpandasdataframedata-sciencedata-analysis

have two dictionary columns with different lengths want to use series explode but it mismatch


I have tow columns in my data frame that have multiple dictionarise in it and I want to expand it into multiple columns but the problem when I use explode series it mismatch

example:

Column A Column B Columns C
Cell 1 {"a":0.5} {"b":0.5} {"d1":0.25}{"d2":0.25}{"d3":0.25}{"d4":0.25}
Cell 2 {"c":1.0} {"t1":0.5} {"t2":0.5}

desirable output:

Column A Column B1 Columns B2 Columns C1 Columns C2
Cell 1 "a" 0.5 "d1" 0.25
Cell 1 "a" 0.5 "d2" 0.25
Cell 1 "b" 0.5 "d3" 0.25
Cell 1 "b" 0.5 "d4" 0.25
Cell 2 "c" 1.0 "t1" 0.5
Cell 2 "c" 1.0 "t2" 0.5

i used to use

df1 = ( 
       df.assign(Columns B2 =lambda df: df.Columns B.apply(lambda x: x.values()))    
     
       .reset_index()
       .apply(pd.Series.explode)
     
            )
output = ( 
       df1.assign(Columns C2 =lambda df: df.Columns C.apply(lambda x: x.values()))    
     
       .reset_index()
       .apply(pd.Series.explode)
     
            )

the problem is that it mismatches the rows something in column C1 supposed to be in row 2 go to 3


Solution

  • Assuming this input and that Column B has the shortest dictionaries:

    df = pd.DataFrame({'Column A': ['Cell 1', 'Cell 2'],
                       'Column B': [[{'a': 0.5}, {'b': 0.5}], [{'c': 1.0}]],
                       'Column C': [[{'d1': 0.25}, {'d2': 0.25}, {'d3': 0.25}, {'d4': 0.25}], [{'t1': 0.5}, {'t2': 0.5}]]})
    

    You can try to repeat the shortest list, then convert your key/value pairs as new DataFrames:

    tmp = pd.DataFrame([[np.repeat(a, int(np.ceil(len(b)/len(a))))[:len(b)], b]
                        for a, b in zip(df['Column B'], df['Column C'])],
                       columns=['Column B', 'Column C']
                       ).explode(['Column B', 'Column C'])
    
    def convert(col):
        s = col.apply(lambda x: list(x.items())[0])
        return (pd.DataFrame(s.tolist(), index=s.index)
                  .rename(columns=lambda x: f'{col.name}{x+1}')
                )
    
    out = (df.drop(columns=['Column B', 'Column C'])
             .join(pd.concat([convert(tmp[c]) for c in list(tmp)], axis=1))
          )
    

    Output:

      Column A Column B1  Column B2 Column C1  Column C2
    0   Cell 1         a        0.5        d1       0.25
    0   Cell 1         a        0.5        d2       0.25
    0   Cell 1         b        0.5        d3       0.25
    0   Cell 1         b        0.5        d4       0.25
    1   Cell 2         c        1.0        t1       0.50
    1   Cell 2         c        1.0        t2       0.50