Search code examples
pythonpandasdataframegraphadjacency-list

Create an adjacency list from a pandas Dataframe containing nodes


I have a pandas DataFrame containing rows of nodes that I ultimately would like to connect and turn into a graph like object. For this, I first thought of converting this DataFrame to something that resembles an adjacency list, to later on easily create a graph from this. I have the following:

A pandas Dataframe:

df = pd.DataFrame({"id": [0, 1, 2, 3, 4, 5, 6],
                   "start": ["A", "B", "D", "A", "X", "F", "B"],
                   "end": ["B", "C", "F", "G", "X", "X", "E"],
                   "cases": [["c1", "c2", "c44"], ["c2", "c1", "c3"], ["c4"], ["c1", ], ["c1", "c7"], ["c4"], ["c44", "c7"]]})

which looks like this:

    id  start   end     cases            
0   0   A       B       [c1, c2, c44]    
1   1   B       C       [c2, c1, c3]     
2   2   D       F       [c4]             
3   3   A       G       [c1]             
4   4   X       X       [c1, c7]         
5   5   F       X       [c4]             
6   6   B       E       [c44, c7]        

A function directly_follows(i, j) that returns true if the node in row i is followed by the node in row j (this wil later be a directed edge in a graph from node i to node j):

def directly_follows(row1, row2):
    return close(row1, row2) and case_overlap(row1, row2)

def close(row1, row2):
    return row1["end"] == row2["start"]

def case_overlap(row1, row2):
    return not set(row1["cases"]).isdisjoint(row2["cases"])

Shortly, node i is followed by node j if the end value of node i is the same as the start value of node j and if their cases overlap

Based on this directly_follows function, I want to create an extra column to my DataFrame df which acts as an adjacency list, containing for node i a list with the id values of nodes that follow i

My desired result would thus be:

    id  start   end     cases            adjacency_list
0   0   A       B       [c1, c2, c44]    [1, 6]
1   1   B       C       [c2, c1, c3]     []
2   2   D       F       [c4]             [5]
3   3   A       G       [c1]             []
4   4   X       X       [c1, c7]         []
5   5   F       X       [c4]             []
6   6   B       E       [c44, c7]        []

Basically I thought of first creating the column adjacency_list as empty lists, and then looping through the rows of the Dataframe and if for row i and j directly_follows(row_i, row_j) returns True, add the id of j to the adjacency list of i.

I did it like this:

def connect(data):
    data["adjacency_list"] = np.empty((len(data), 0)).tolist()
    
    for i in range(len(data)):
        for j in range(len(data)):
            if i != j:
                if directly_follows(data.iloc[i], data.iloc[j]):
                    data.iloc[i]["adjacency_list"] = data.iloc[i]["adjacency_list"].append(data.iloc[i]["id"])

Now first, this returns an error

SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame

And secondly, I highly doubt this is the most pythonic and efficient way to solve this problem, since my actual DataFrame consists of about 9000 rows, which would give around 81 million comparisons.

How to create the adjacency list in the least time consuming way? Is there maybe a faster or more elegant solution than mine?


Solution

  • One option would be to apply the following function - it's not completely vectorised because Dataframes don't particularly like embedding mutable objects like lists, and I don't think you can apply set operations in a vectorised way. It does cut down the number of comparisons needed though.

    def f(x):
        check = df[(x["end"] == df["start"])]
        return [
            row["id"]
            for i, row in check.iterrows()
            if not set(row["cases"]).isdisjoint(x["cases"])
        ]
    
    
    df["adjacency_list"] = df.apply(f, axis=1)
    

    Or, as a big lambda function:

    df["adjacency_list"] = df.apply(
        lambda x: [
            row["id"]
            for i, row in df[(x["end"] == df["start"])].iterrows()
            if not set(row["cases"]).isdisjoint(x["cases"])
        ],
        axis=1,
    )
    

    Output

       id start end          cases adjacency_list
    0   0     A   B  [c1, c2, c44]         [1, 6]
    1   1     B   C   [c2, c1, c3]             []
    2   2     D   F           [c4]            [5]
    3   3     A   G           [c1]             []
    4   4     X   X       [c1, c7]            [4]
    5   5     F   X           [c4]             []
    6   6     B   E      [c44, c7]             []