python matrix indexing vectorization dimensionality-reduction

How to vectorize populating larger matrix with items of smaller matrix in python

I have some small symmetric matrices that are low dimensional representations of larger symmetric matrices. I have a vector that is a key showing which cells of the high-D matrix should be linked to which cells in the low-D matrix.

I would like to recreate these larger matrices by populating the larger matrix with its corresponding value in the low dimensional matrix. I believe there should be a vectorized approach to this, but so far all I've been able to come up with is a simple nested for loop, which is prohibitively slow for these matrices (10k+ rows & columns).

In this toy example, key is vec1, low-D matrix is source_mat, and high-D matrix is target_mat. I need to create target_mat where each cell is filled with the corresponding value from source_mat according to the key.

    import pandas as pd
    import numpy as np
    import random

    vec1=[]
    for x in range (0, 100):
        vec1.append(random.randint(0, 19)) #creating the key

    vec1=pd.DataFrame(vec1)
    sizevec1=vec1.shape[0]
    matshape=(sizevec1,sizevec1)
    target_mat=np.zeros(matshape) #key and target have same shape
    target_mat=pd.DataFrame(target_mat)

    temp=np.random.random((20,20))
    source_mat=temp*temp.T

    for row in range(0,target_mat.shape[0]):
        for column in range(0,target_mat.shape[1]):
            print 'row is ', row
            print 'column is', column
            target_mat.iloc[row,column] = source_mat.item(int(vec1.iloc[row]), int(vec1.iloc[column]))

Solution

Below are two separate updates to the code that led to pretty dramatically speedups.

First- figured out the vectorized solution, so now the calculation is done all in one step. This is the fastest method even after the second change-

Second- changed all pandas dataframe's to numpy arrays. This change had the greatest impact on the for loop code- which runs orders of magnitude faster now.

The code below calculates all 3 of the methods, the 'slow', 'fast', and 'Xu Mackenzie', named for friends that thought up the vectorized solution ;-P

#Initialize Variables

import time
import random
import pandas as pd
import numpy as np

n=13000
k=2000
i=0
vec1=[]
for x in range(0, n):
   vec1.append(random.randint(0, k-1))

temp=np.random.random((k,k))
#vec1=pd.DataFrame(vec1)
vec1=np.array(vec1)
#vec=pd.DataFrame(np.arange(0,300))
#vec2=pd.concat([vec,vec1], axis=1)
#sizevec1=vec1.shape[0]
sizevec1=len(vec1)
matshape=(sizevec1,sizevec1)
target_mat=np.zeros(matshape)
#target_mat=pd.DataFrame(target_mat)


source_mat=temp*temp.T
transform_mat=np.zeros((len(source_mat),len(target_mat)))

Slow Solution

matrixtime = time.time()
for row in range(0,target_mat.shape[0]):
    #print 'row is ', row
    for column in range(0,target_mat.shape[1]):

        #print 'column is', column
        target_mat[row,column] = source_mat.item(int(vec1[row]), int(vec1[column]))
print((time.time() - matrixtime))
target_mat_slow=target_mat
target_mat=np.zeros(matshape)

XU MACKENZIE SOLUTION

matrixtime = time.time()

for i in range(0,len(target_mat)):
  transform_mat[vec1[i],i]=1

temp=np.dot(source_mat,transform_mat)
target_mat=np.dot(temp.T,transform_mat)
target_mat_XM=target_mat
target_mat=np.zeros(matshape)
XM_time= time.time() - matrixtime
print((time.time() - matrixtime))

Previous 'fast' solution

matrixtime = time.time()
for row in range(0,source_mat.shape[0]):
    print 'row is ', row
    #for column in range(0, source_mat.shape[1]):
    for column in range(0, row):   
        rowmatch = np.array([vec1==row])
        rowmatch = rowmatch*1

        colmatch = np.array([vec1==column])
        colmatch = colmatch*1

        match_matrix=rowmatch*colmatch.T
        target_mat=target_mat+(match_matrix*source_mat[row,column])

print((time.time() - matrixtime))
target_mat_fast=target_mat
target_mat=np.zeros(matshape)

TEST FOR EQUIVALENCE

target_mat_slow==target_mat_fast
target_mat_fast==target_mat_XM