Search code examples
pythonpandaslistdataframeragged

Ragged list to dataframe


I have a non-uniform list as follows:

[['E', 'A', 'P'],
 ['E', 'A', 'X', 'P'],
 ['E', 'A', 'P'],
 ['P'],
 ['E', 'A', 'X', 'P'],
 ['E', 'A', 'P'],
 ['A', 'X', 'P'],
 ['E', 'A', 'P'],
 ['E', 'A', 'P'],
 ['E', 'A', 'X', 'P'],
 ['E', 'A', 'P'],
 ['E', 'A', 'P'],
 ['A', 'X', 'P'],

I would like to create a data frame from this, where each column represents the four possible letters "E", "A", "X" and "p" in a one-hot encoded manner - what is the most efficient way to go about this?


Solution

  • Try:

    lst = [
        ["E", "A", "P"],
        ["E", "A", "X", "P"],
        ["E", "A", "P"],
        ["P"],
        ["E", "A", "X", "P"],
        ["E", "A", "P"],
        ["A", "X", "P"],
        ["E", "A", "P"],
        ["E", "A", "P"],
        ["E", "A", "X", "P"],
        ["E", "A", "P"],
        ["E", "A", "P"],
        ["A", "X", "P"],
    ]
    
    df = pd.DataFrame({v: 1 for v in l} for l in lst).notna().astype(int)
    print(df)
    

    Prints:

        E  A  P  X
    0   1  1  1  0
    1   1  1  1  1
    2   1  1  1  0
    3   0  0  1  0
    4   1  1  1  1
    5   1  1  1  0
    6   0  1  1  1
    7   1  1  1  0
    8   1  1  1  0
    9   1  1  1  1
    10  1  1  1  0
    11  1  1  1  0
    12  0  1  1  1