Search code examples
pythonnumpymatlabstatistics

p-values for all pairs between two matrices to achieve matlab's corr function


I have been trying to implement in Python (with numpy and scipy) this variant of Matlab's corr function, but it seems I cannot solve it by myself. What I need is to implement the alternative Matlab corr implementation:

[rho,pval] = corr(X,Y)

I would appreciate any help on this!!!

What I tried:

I tried to modify the solutions posted here and in this other thread, without much success. For instance, I was able to hstack the two matrixes X and Y, and by keeping a part of the correlation matrix, I got the right result for the correlations. However, the same trick is NOT working for the p-values. Actually, both the solutions in the other threads gave me the (more or less) correct values for the correlation coefficients, which is great, but I cannot reproduce the behavior of the Matlab implementation for the p-values.

Also, the solution here aims to reproduce corrcoef's behaviour, which, according to Matlab's documentation, converts the input matrices into column vectors before computing the correlations.

On the other hand, I also tried to hstack the matrices in Matlab, and again got the same answer for the correlations, but the values are quite different for the p-values between what I get in Python and what I got in Matlab. This made me think that the problem perhaps is in the statistic being calculated. However, according to the Matlab documentation, it uses:

corr computes the p-values for Pearson's correlation using a Student's t distribution for a transformation of the correlation

and, from the documentation in SciPy I think they are using the same test, but I am not 100% sure, given that the bibligraphic reference there is for Student's paper, which is the one that Matlab's docs say it uses (Student's r), but as I said, I am NOT sure at all.


Solution

  • I think the easiest way is to do it in pandas, using scipy.stats pearsonr, which returns the pairwise rho and pval. I tested with some sample below, and I believe the results match the matlab results.

    import numpy as np
    from scipy.stats import pearsonr
    import pandas as pd
    
    
    X = np.array([
        [0.5377, 0.3188, 3.5784, 0.7254],
        [1.8339, -1.3077, 2.7694, -0.0631],
        [-2.2588, -0.4336, -1.3499, 0.7147],
        [0.8622, 0.3426, 3.0349, -0.2050]
    ])
    
    Y1 = np.array([
        [-0.1241, 0.6715, 0.4889, 0.2939],
        [1.4897, -1.2075, 1.0347, -0.7873],
        [1.4090, 0.7172, 0.7269, 0.8884],
        [1.4172, 1.6302, -0.3034, -1.1471]
    ])
    
    Y2 = Y1
    Y2[:, 3] = Y2[:, 3] + X[:, 1]
    
    df1 = pd.DataFrame(X)
    df2 = pd.DataFrame(Y2)
    
    coeffmat = np.zeros((df1.shape[1], df2.shape[1]))
    pvalmat = np.zeros((df1.shape[1], df2.shape[1]))
    
    for i in range(df1.shape[1]):
        for j in range(df2.shape[1]):
            corrtest = pearsonr(df1[df1.columns[i]], df2[df2.columns[j]])
    
            coeffmat[i,j] = corrtest[0]
            pvalmat[i,j] = corrtest[1]
    
    dfcoeff = pd.DataFrame(coeffmat, columns=df2.columns, index=df1.columns)
    print(dfcoeff)
    dfpvals = pd.DataFrame(pvalmat, columns=df2.columns, index=df1.columns)
    print(dfpvals)
    

    I also tried playing around with pandas corrwith function (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corrwith.html#pandas.DataFrame.corrwith) but didn't spend enough time. I think there is a more elegant solution which uses that.

    Hope this helps