I am trying to do a pairwise comparison for 100+ tables to find the rows in common by "ID". Is there a simple way to do such a pairwise comparison without going through a lot of loops?
The expected output is a matrix/heatmap showing the similarity between tables, with number of rows in common between the tables.
A sample of the tables I have is attached. They are all saved in the same directory at my machine.
Thank you,
If I understand correctly, you want the pairwise similarity of the ID columns only, regardless of whether the Value column has different values for that ID. Assuming your tables are represented as a list of dataframes called 'dfs':
m = np.zeros(shape=(len(dataframes), len(dataframes)))
for i, d in enumerate(dataframes):
for j, d_ in enumerate(dataframes):
commons = len(set(d.ID).intersection(set(d_.ID)))
m[i][j] = commons
df = pd.DataFrame(m).applymap(int)
print(df)