Search code examples
pythonpandasmatrixintersection

Python, how to fill an empty dataframe with lists


I'm trying to write a code to save in a matrix the common elements between some lists. Example:

Data frame with all the lists:

ID elements of the ID
G1 P1,P2,P3,P4
G2 P3,P5
G3 P1,P3,P5
G4 P6

I start with an empty matrix having G1,G2,G3,G4 as columns and rows names and the cells filled with nan, the result I would like to obtain is the following:

X G1 G2 G3 G4
G1 P1,P2,P3,P4 P3 P1 None
G2 P3 P3,P5 P3,P5 None
G3 P1,P5 P3,P5 P1,P3,P5 None
G4 None None None P6

This is my code:

import sys
import pandas as pd

def intersection(lst1, lst2):
    return [value for value in lst1 if value in lst2]

data = pd.read_csv(sys.argv[1], sep="\t")
p_mat = pd.read_csv(sys.argv[2], sep="\t", index_col=0)
c_mat = pd.read_csv(sys.argv[3], sep="\t", index_col=0)

#I need this since the elements of the second column once imported are seen as a single string instead of being lists
for i in range(0,len(data)):
    data['MP term list'][i] = data['MP term list'][i].split(",")


for i in p_mat:
    for j in p_mat.columns:
        r = intersection(data[data['MGI id'] == i]['MP term list'].values.tolist()[0],data[data['MGI id'] == j]['MP term list'].values.tolist()[0])
        if len(r)!=0:
            p_mat.at[i,j] = r
        else:
            p_mat.at[i, j] = None
        del(r) 

For now I'm able to fill only the first cell correctly, then at the first non-empty result that I try to store in a cell I get this error:

ValueError: Must have equal len keys and value when setting with an iterable

How can I fix it?


Solution

  • Try with a cross merge, set intersection and pivot:

    df["elements"] = df["elements of the ID"].str.split(",").map(set)
    
    cross = df[["ID", "elements"]].merge(df[["ID", "elements"]], how="cross")
    cross["intersection"] = (cross.apply(lambda row: row["elements_x"].intersection(row["elements_y"]), axis=1)
                                  .map(",".join)
                                  .replace("",None)
                            )
    
    output = cross.pivot("ID_x", "ID_y", "intersection").rename_axis(None, axis=1).rename_axis(None)
    
    >>> output
                 G1     G2        G3    G4
    G1  P2,P1,P3,P4     P3     P1,P3  None
    G2           P3  P3,P5     P3,P5  None
    G3        P1,P3  P3,P5  P1,P3,P5  None
    G4         None   None      None    P6
    
    Input df:
    df = pd.DataFrame({"ID": [f"G{i+1}" for i in range(4)],
                       "elements of the ID": ["P1,P2,P3,P4", "P3,P5", "P1,P3,P5", "P6"]})