Search code examples
pythoncluster-analysisseabornheatmap

How to fix Seaborn clustermap "condensed distance matrix must contain only finite values" error?


I have a three column csv file that I am trying to convert to a clustered heatmap. My code looks like this:

sum_mets = pd.read_csv('sum159_localization_met_magma.csv')
df5 = sum_mets[['Phenotype','Gene','P']]

clustermap5 = sns.clustermap(df5, cmap= 'inferno',  figsize=(40, 40), pivot_kws={'index': 'Phenotype', 
                                  'columns' : 'Gene',
                                  'values' : 'P'})

I then receive this ValueError:

ValueError: The condensed distance matrix must contain only finite values.

For context all of my values are non-zero. I am not sure what values is it unable to process. Thank you in advance to anyone who can help.


Solution

  • While you have no NaN, you need to check whether your observations are complete, because there is a pivot underneath, for example:

    df = pd.DataFrame({'Phenotype':np.repeat(['very not cool','not cool','very cool','super cool'],4),
                       'Gene':["Gene"+str(i) for i in range(4)]*4,
                       'P':np.random.uniform(0,1,16)})
    
    pd.pivot(df,columns="Gene",values="P",index="Phenotype")
    
    Gene    Gene0   Gene1   Gene2   Gene3
    Phenotype               
    not cool    0.567653    0.984555    0.634450    0.406642
    super cool  0.820595    0.072393    0.774895    0.185072
    very cool   0.231772    0.448938    0.951706    0.893692
    very not cool   0.227209    0.684660    0.013394    0.711890
    

    The above pivots without NaN, and plots well:

    sns.clustermap(df,figsize=(5, 5),pivot_kws={'index': 'Phenotype','columns' : 'Gene','values' : 'P'})
    

    enter image description here

    but let's say if we have 1 less observation:

    df1 = df[:15]
    pd.pivot(df1,columns="Gene",values="P",index="Phenotype")
    
    Gene    Gene0   Gene1   Gene2   Gene3
    Phenotype               
    not cool    0.106681    0.415873    0.480102    0.721195
    super cool  0.961991    0.261710    0.329859    NaN
    very cool   0.069925    0.718771    0.200431    0.196573
    very not cool   0.631423    0.403604    0.043415    0.373299
    

    And it fails if you try to call clusterheatmap:

    sns.clustermap(df1, pivot_kws={'index': 'Phenotype','columns' : 'Gene','values' : 'P'})
    The condensed distance matrix must contain only finite values.
    

    I suggest checking whether the missing values are intended or a mistake. So if you indeed have some missing values, you can get around the clustering but pre-computing the linkage and passing it to the function, for example using correlation below:

    import scipy.spatial as sp, scipy.cluster.hierarchy as hc
    
    row_dism = 1 - df1.T.corr()
    row_linkage = hc.linkage(sp.distance.squareform(row_dism), method='complete')
    col_dism = 1 - df1.corr()
    col_linkage = hc.linkage(sp.distance.squareform(col_dism), method='complete')
    
    sns.clustermap(df1,figsize=(5, 5),row_linkage=row_linkage, col_linkage=col_linkage)
    

    enter image description here