Search code examples
pythonpandasmatrixpairwise

Convert long-form dataframe of pairwise distances to distance matrix in python


I have a pandas dataframe of pairwise distances in the form of:

    SampleA   SampleB  Num_Differences
0  sample_1  sample_2                1
1  sample_1  sample_3                4
2  sample_2  sample_3                8

Note that there are no self-self comparisons (e.g., sample_1 vs sample_1 won't be represented). I would like to convert this table into a squareform distance matrix instead, like so:

            sample_1      sample_2  sample_3
sample_1                       1              4
sample_2         1                            8
sample_3         4             8    

Can anyone give me some pointers on how to do such a conversion in python? The problem is analogous to a previous question in R (Converting pairwise distances into a distance matrix in R), but I don't know the corresponding python functions to use. The problem also appears to be the opposite of this question (Convert a distance matrix to a list of pairwise distances in Python).

Some code to reproduce a dataframe in the form I'm using:

df = pd.DataFrame([['sample_1', 'sample_2', 1],
                   ['sample_1', 'sample_3', 4],
                   ['sample_2', 'sample_3', 8]],
                  columns=['SampleA', 'SampleB', 'Num_Differences'])

Solution

  • You can reshape to square, and then make symmetrical by adding the transposed values:

    # make unique, sorted, common index
    idx = sorted(set(df['SampleA']).union(df['SampleB']))
    
    # reshape
    (df.pivot(index='SampleA', columns='SampleB', values='Num_Differences')
       .reindex(index=idx, columns=idx)
       .fillna(0, downcast='infer')
       .pipe(lambda x: x+x.values.T)
     )
    

    Alternatively, you can use ordered categorical indexes and keep NAs during reshaping with pivot_table. Then add the transposed values to make symmetrical:

    cat = sorted(set(df['SampleA']).union(df['SampleB']))
    
    (df.assign(SampleA=pd.Categorical(df['SampleA'],
                                      categories=cat,
                                      ordered=True),
               SampleB=pd.Categorical(df['SampleB'],
                                      categories=cat,
                                      ordered=True),
               )
        .pivot_table(index='SampleA',
                     columns='SampleB',
                     values='Num_Differences',
                     dropna=False, fill_value=0)
        .pipe(lambda x: x+x.values.T)
    )
    

    Output:

    SampleB   sample_1  sample_2  sample_3
    SampleA                               
    sample_1         0         1         4
    sample_2         1         0         8
    sample_3         4         8         0