Search code examples
pythonpandasnumpycorrelationmatrix-multiplication

Modifying a matrix to include various data features


I have the following correlations:

    a     b     c
a   1     0.9   0.8
b   0.9   1     0.7
c   0.8   0.7   1

and the following vectors:

col1    col2    col3
c1      a       5
c2      a       -6
c1      b       10
c2      b       -15
c1      c       24
c2      c       -30

The aim is to multiply the 'col3' vector vs the correlation matrix matching elements per 'col2', all whilst bearing in mind that an extra weight of 0.7 is to be added when 'col1' differs:

enter image description here

blue Corr(c1a,c1a): 25 = 5 * 5 * 1

green Corr(c1a,c2a): -21 = 5 * -6 * 1 * .7

yellow Corr(c1a,c1b): 45 = 5 * 10 * .9

red Corr(c1a,c2b): -47.25 = 5 * -15 * .9 * .7

'col1' won't always differ (there are cases when it will always be 'c1' so the extra 0.7 weight won't be needed and the problem will just be straight matrix multiplication)

Is there a more pythonic way to express the above?

My initial idea was to expand the correlation matrix as follows:

enter image description here

blue: 1

green: 0.7 = 1 * 0.7

yellow: 0.9

red: 0.63 = 0.9 * 0.7

and then do the aforementioned operation via einsum or tensordot with the vectors. (But I'm struggling setting this up dynamically for other vector scenarios) Perhaps there is a better way to approach this. Any other suggestions would be great!

SET UP:

import pandas as pd

corr = pd.DataFrame({'a':[1,.9,.8],
                     'b':[.9,1,.7],
                     'c':[.8,.7,1]},
                     index = ['a','b','c'])

inputs = pd.DataFrame({'col1':['c1','c2','c1','c2','c1','c2'],
                       'col2':['a','a','b','b','c','c'],
                       'col3':[5,-6,10,-15,24,-30]})

Solution

  • You could try with a dot product and a conditional multiply:

    Note that your corr matrix in the setup is missing the index labels hence this additional step, you can ignore this if the corr matrix is what you show in the image.

    corr1 = corr.set_axis(corr.columns,inplace=False)
    

    Solution:

    u = inputs.set_index(['col1','col2'])
    v = u.dot(u.T).mul(corr1,level=1)
    cond = (v.index.get_level_values(0).to_numpy()[:,None] 
            != v.columns.get_level_values(0).to_numpy()) #Credits wwii
    #cond = (np.array(v.index.get_level_values(0))[:,None] 
    #        != np.array(v.columns.get_level_values(0))[:,None].T)
    out = v.mask(cond,v*0.7)
    

    print(out)
    
    col1          c1      c2     c1      c2      c1     c2
    col2           a       a      b       b       c      c
    col1 col2                                             
    c1   a     25.00  -21.00   45.0  -47.25   96.00  -84.0
    c2   a    -21.00   36.00  -37.8   81.00  -80.64  144.0
    c1   b     45.00  -37.80  100.0 -105.00  168.00 -147.0
    c2   b    -47.25   81.00 -105.0  225.00 -176.40  315.0
    c1   c     96.00  -80.64  168.0 -176.40  576.00 -504.0
    c2   c    -84.00  144.00 -147.0  315.00 -504.00  900.0