Search code examples
pythonlistpandaspython-itertools

multiple nested lists with permutations to pandas


I have my first serious question in python.

I have a few nested lists that I need to convert to pandas DataFrame. Seems easy, but what makes it challenging for me: - the lists are huge (so the code needs to be fast) - they are nested - when they are nested, I need combinations.

So having this input:

la =  ['a', 'b', 'c', 'd', 'e']
lb = [[1], [2], [3, 33], [11,12,13], [4]]
lc = [[1], [2, 22], [3], [11,12,13], [4]]

I need the below as output

la      lb      lc
a       1       1
b       2       2
b       2       22
c       3       3
c       33      3
d       11      11
d       11      12
d       11      13
d       12      11
d       12      12
d       12      13
d       13      11
d       13      12
d       13      13
e       4       4

Note that I need all permutations whenever I have a nested list. At first I tried simply:

import pandas as pd
pd.DataFrame({'la' : [x for x in la],
              'lb' : [x for x in lb],
              'lc' : [x for x in lc]})

But looking for rows that need expanding and actually expanding (a huge) DataFrame seemed harder than tinkering around the way I create the DataFrame.

I looked at some great posts about itertools (Flattening a shallow list in Python ), the documentation (https://docs.python.org/3.6/library/itertools.html) and generators (What does the "yield" keyword do?), and came up with something like this:

import itertools

def f(la, lb, lc):
    tmp = len(la) == len(lb) == len(lc)
    if tmp:
        for item in range(len(la)):
            len_b = len(lb[item])
            len_c = len(lc[item])
            if ((len_b>1) or (len_c>1)):
                yield list(itertools.product(la[item], lb[item], lc[item]))
                ## above: list is not the result I need,
                ##        without it it breaks (not an iterable)
            else:
                yield (la[item], lb[item], lc[item])
    else:
        print('error: unequal length')

which I test

my_gen =f(lit1, lit2, lit3)
pd.DataFrame.from_records(my_gen)

which... well... breaks when i yield itertools (it has no length), and creates a wrong data structure after I cast itertools to an iterable.

My questions are as follow:

  • how can I fix that issue with yielding itertools?
  • is this efficient? In real application I will be creating the lists by parsing a file and they will be huge... Any performance tips or better solutions from more advanced colleagues? Right not it breaks/misbehaves so I can't even benchmark...
  • would it make sense to generate the lists element by element and then use my f function?

Thank you in advance!


Solution

  • I have a solution:

    import pandas as pd
    from itertools import product
    
    la =  ['a', 'b', 'c', 'd', 'e']
    lb = [[1], [2], [3, 33], [11,12,13], [4]]
    lc = [[1], [2, 22], [3], [11,12,13], [4]]
    
    list_product = reduce(lambda x, y: x + y, [list(product(*_)) for _ in zip(la,lb,lc)])
    df = pd.DataFrame(list_product, columns=["la", "lb", "lc"])
    print(df)
    

    result:

        la  lb  lc
    0   a   1   1
    1   b   2   2
    2   b   2   22
    3   c   3   3
    4   c   33  3
    5   d   11  11
    6   d   11  12
    7   d   11  13
    8   d   12  11
    9   d   12  12
    10  d   12  13
    11  d   13  11
    12  d   13  12
    13  d   13  13
    14  e   4   4