Search code examples
pythonpandasnumpymatrixmultiplication

Creating a co-occurence heatmap for a dictionary of lists


I'm having trouble figuring out what packages/logical flow to this problem works best.

I have a dictionary like so (the list of values have been shortened for readability):

dict = {'term_1': ['30939593',
  '30938516',
  '30930058',
  '30928978',
  '30927713',
  '30927284',
  '30925500',
  '30923740',
  '30922102',
   ...],
'term_2': ['30931235',
  '30938516',
  '30928978',
  '30922102',
  '30858642',
  '30828702',
  '30815562',
  '30805732',
  '30766735',
  '30746412',
  '30740089',
   ...],
   etc. 
}

Between the two terms I've listed there are three values that co-occur (30938516,30928978, and 30922102).

The dictionary contains about 1800 keys, each with a list of values that are corresponding IDs, and some of these lists may be 100,000 values long.

I want to be able to visualize, in a heatmap, the degree of similarity between every term in the dictionary based on co-occurrence of IDs within the list of values. As in, the x and y axis of the heatmap would be labelled by the same terms in sequential order and each cell of the heatmap would show the overlap of IDs between one term and another term by counts of how many co-occurring values there are (in this case, the co-occurrence between term_1 and term_2 would be 3). This would be repeated for all 1800 terms, leading to an 1800x1800 heatmap.


Regarding the values as strings, I've tried converting the dictionary into two dataframes: one where the terms are the column headers and the values are listed by column, and the other where the terms are the row headers and the values are listed by row.

Firstly, I converted the dictionary into a dataframe

df = pd.DataFrame.from_dict(dict, orient = 'index')
df = df[df.columns[0:]].apply(
    lambda x: ','.join(x.dropna().astype(str).astype(str)),
    axis = 1
)

However, this only converts the dictionary into a single column of length 1800. I would also need to find a way to expand the dataframe so that each column is duplicated 1800 times.

Once I have this 1800 x 1800 column, I would transpose it.

df_transposed = df.T

If we are to treat each set of cells of the dataframe that we are comparing as two lists, we can approach each comparison like so

l1 = ['30939593',
  '30938516',
  '30930058',
  '30928978',
  '30927713',
  '30927284',
  '30925500',
  '30923740',
  '30922102']
l2 = ['30931235',
  '30938516',
  '30928978',
  '30922102',
  '30858642',
  '30828702',
  '30815562',
  '30805732',
  '30766735',
  '30746412',
  '30740089']
from collections import Counter
c = len(list((Counter(l1) & Counter(l2)).elements()))

c = 3

However, I am unsure of how to loop through this within the confines of a dataframe

I want to compare each cell of the 1800x1800 grid such that each grid contains an integer value of how many co-occuring IDs there were in each cell, between each term. I would then convert this 1800x1800 grid of integers to a heatmap.


Solution

  • One way is to calculate the overlaps first based on the dictionary d and then make the required DataFrame with pivot:

    x = [(k1, k2, len(set(d1) & set(d2))) for k1,d1 in d.items() for k2,d2 in d.items()]
    df = pd.DataFrame(x).pivot(index=0, columns=1, values=2)
    
    print(df)
    

    Output:

    1       term_1  term_2
    0                     
    term_1       9       3
    term_2       3      11
    

    And, of course, for the heatmap:

    sns.heatmap(df)
    

    Output:

    picture