Search code examples
pythondataframerowspairwise

Pairwise row match of tables in Python


I am trying to do a pairwise comparison for 100+ tables to find the rows in common by "ID". Is there a simple way to do such a pairwise comparison without going through a lot of loops?

The expected output is a matrix/heatmap showing the similarity between tables, with number of rows in common between the tables.

A sample of the tables I have is attached. They are all saved in the same directory at my machine.

Thank you,

Table 1 Table 2


Solution

  • If I understand correctly, you want the pairwise similarity of the ID columns only, regardless of whether the Value column has different values for that ID. Assuming your tables are represented as a list of dataframes called 'dfs':

    m = np.zeros(shape=(len(dataframes), len(dataframes)))
    for i, d in enumerate(dataframes):
        for j, d_ in enumerate(dataframes):
            commons = len(set(d.ID).intersection(set(d_.ID)))
            m[i][j] = commons
    df = pd.DataFrame(m).applymap(int)
    print(df)