Search code examples
pythonnumpydataframemulti-indexquadratic

Quadratic n term equation using multiindex


I have two DFs which I would like to use to calculate the following:

w(ti,ti)*a(ti)^2 + w(tj,tj)*b(sj,tj)^2 + 2*w(si,tj)*a(ti)*b(tj)

The above uses two terms (a,b). w is the weight df where i and j are index and column spaces pertaining to the Tn index of a and b.

Set Up - Edit dynamic W

import pandas as pd
import numpy as np

I = ['i'+ str(i) for i in range(4)]
Q = ['q' + str(i) for i in range(5)]
T = ['t' + str(i) for i in range(3)]
n = 100

df1 = pd.DataFrame({'I': [I[np.random.randint(len(I))] for i in range(n)],
                    'Q': [Q[np.random.randint(len(Q))] for i in range(n)],
                    'Tn': [T[np.random.randint(len(T))] for i in range(n)],
                    'V': np.random.rand(n)}).groupby(['I','Q','Tn']).sum()

df1.head(5)
I  Q  Tn  V        
i0 q0 t0  1.626799
      t2  1.725374
   q1 t0  2.155340
      t1  0.479741
      t2  1.039178

w = np.random.randn(len(T),len(T))
w = (w*w.T)/2
np.fill_diagonal(w,1)
W = pd.DataFrame(w, columns = T, index = T)

W
          t0        t1        t2
t0  1.000000  0.029174 -0.045754
t1  0.029174  1.000000  0.233330
t2 -0.045754  0.233330  1.000000

Effectively I would like to use the index Tn in df1 to use the above equation for every I and Q.

The end result for df1.loc['i0','q0'] in the example above should be:

  W(t0,t0) * V(t0)^2 
+ W(t2,t2) * V(t2)^2
+ 2 * W(t0,t2) * V(t0) * V(t2) 
=     
  1.0 * 1.626799**2 
+ 1.0 * 1.725374**2 
+ (-0.045754) * 1.626799 * 1.725374

The end result for df1.loc['i0','q1'] in the example above should be:

  W(t0,t0) * V(t0)^2 
+ W(t1,t1) * V(t1)^2
+ W(t2,t2) * V(t2)^2
+ 2 * W(t0,t1) * V(t0) * V(t1)
+ 2 * W(t0,t2) * V(t0) * V(t2)
+ 2 * W(t2,t1) * V(t1) * V(t2)
=     
  1.0 * 2.155340**2 
+ 1.0 * 0.479741**2
+ 1.0 * 1.039178**2
+ 0.029174 * 2.155340 * 0.479741 * 1
+ (-0.045754) * 2.155340 * 1.039178 * 1
+ 0.233330 * 0.479741 * 1.039178 * 1

This pattern will repeat depending on the number of tn terms in each Q hence it should be robust enough to handle as many Tn terms as needed (in the example I use 3, but it could be as much as 100 or more).

Each result should then be saved in a new DF with Index = [I, Q] The solution should also not be slower than excel when n increases in value.

Thanks in advance


Solution

  • One way could be first reindex your dataframe df1 with all the possible combinations of the lists I, Q and Tn with pd.MultiIndex.from_product, filling the missing value in the column 'V' with 0. The column has then len(I)*len(Q)*len(T) elements. Then you can reshape the values to get each row related to one combination on I and Q such as:

    ar = (df1.reindex(pd.MultiIndex.from_product([I,Q,T], names=['I','Q','Tn']),fill_value=0)
             .values.reshape(-1,len(T)))
    

    To see the relation between my input df1 and ar, here are some related rows

    print (df1.head(6))
                     V
    I  Q  Tn          
    i0 q0 t1  1.123666
       q1 t0  0.538610
          t1  2.943206
       q2 t0  0.570990
          t1  0.617524
          t2  1.413926
    print (ar[:3])
    [[0.         1.1236656  0.        ]
     [0.53861027 2.94320574 0.        ]
     [0.57099049 0.61752408 1.4139263 ]]
    

    Now, to perform the multiplication with the element of W, one way is to create the outer product of ar with itself but row-wise to get, for each row a len(T)*len(T) matrix. For example, for the second row:

    [0.53861027 2.94320574 0.        ]
    

    becomes

    [[0.29010102, 1.58524083, 0.        ], #0.29010102 = 0.53861027**2, 1.58524083 = 0.53861027*2.94320574 ...
     [1.58524083, 8.66246003, 0.        ],
     [0.        , 0.        , 0.        ]]
    

    Several methods are possible such as ar[:,:,None]*ar[:,None,:] or np.einsum with the right subscript: np.einsum('ij,ik->ijk',ar,ar). Both give same result.

    The next step can be done with a tensordot and specify the right axes. So with ar and W as an input, you do:

    print (np.tensordot(np.einsum('ij,ik->ijk',ar,ar),W.values,axes=([1,2],[0,1])))
    array([ 1.26262437, 15.29352438, 15.94605435, ...
    

    To check for the second value here, 1*0.29010102 + 1*8.66246003 + 2.*2*1.58524083 == 15.29352438 (where 1 is W(t0,t0) and W(t1,t1), 2 is W(t0,t1))

    Finally, to create the dataframe as expected, use again pd.MultiIndex.from_product:

    new_df = pd.DataFrame({'col1': np.tensordot(np.einsum('ij,ik->ijk',ar,ar),
                                                W.values,axes=([1,2],[0,1]))},
                          index=pd.MultiIndex.from_product([I,Q], names=['I','Q']))
    
    print (new_df.head(3))
                col1
    I  Q            
    i0 q0   1.262624
       q1  15.293524
       q2  15.946054
    ...
    

    Note: if you are SURE that each element of T is at least once in the last level of df1, the ar can be obtain using unstack such as ar=df1.unstack(fill_value=0).values. But I would suggest to use the reindex method above to prevent any error