I have some small symmetric matrices that are low dimensional representations of larger symmetric matrices. I have a vector that is a key showing which cells of the high-D matrix should be linked to which cells in the low-D matrix.
I would like to recreate these larger matrices by populating the larger matrix with its corresponding value in the low dimensional matrix. I believe there should be a vectorized approach to this, but so far all I've been able to come up with is a simple nested for loop, which is prohibitively slow for these matrices (10k+ rows & columns).
In this toy example, key is vec1, low-D matrix is source_mat, and high-D matrix is target_mat. I need to create target_mat where each cell is filled with the corresponding value from source_mat according to the key.
import pandas as pd
import numpy as np
import random
vec1=[]
for x in range (0, 100):
vec1.append(random.randint(0, 19)) #creating the key
vec1=pd.DataFrame(vec1)
sizevec1=vec1.shape[0]
matshape=(sizevec1,sizevec1)
target_mat=np.zeros(matshape) #key and target have same shape
target_mat=pd.DataFrame(target_mat)
temp=np.random.random((20,20))
source_mat=temp*temp.T
for row in range(0,target_mat.shape[0]):
for column in range(0,target_mat.shape[1]):
print 'row is ', row
print 'column is', column
target_mat.iloc[row,column] = source_mat.item(int(vec1.iloc[row]), int(vec1.iloc[column]))
Below are two separate updates to the code that led to pretty dramatically speedups.
First- figured out the vectorized solution, so now the calculation is done all in one step. This is the fastest method even after the second change-
Second- changed all pandas dataframe's to numpy arrays. This change had the greatest impact on the for loop code- which runs orders of magnitude faster now.
The code below calculates all 3 of the methods, the 'slow', 'fast', and 'Xu Mackenzie', named for friends that thought up the vectorized solution ;-P
#Initialize Variables
import time
import random
import pandas as pd
import numpy as np
n=13000
k=2000
i=0
vec1=[]
for x in range(0, n):
vec1.append(random.randint(0, k-1))
temp=np.random.random((k,k))
#vec1=pd.DataFrame(vec1)
vec1=np.array(vec1)
#vec=pd.DataFrame(np.arange(0,300))
#vec2=pd.concat([vec,vec1], axis=1)
#sizevec1=vec1.shape[0]
sizevec1=len(vec1)
matshape=(sizevec1,sizevec1)
target_mat=np.zeros(matshape)
#target_mat=pd.DataFrame(target_mat)
source_mat=temp*temp.T
transform_mat=np.zeros((len(source_mat),len(target_mat)))
matrixtime = time.time()
for row in range(0,target_mat.shape[0]):
#print 'row is ', row
for column in range(0,target_mat.shape[1]):
#print 'column is', column
target_mat[row,column] = source_mat.item(int(vec1[row]), int(vec1[column]))
print((time.time() - matrixtime))
target_mat_slow=target_mat
target_mat=np.zeros(matshape)
matrixtime = time.time()
for i in range(0,len(target_mat)):
transform_mat[vec1[i],i]=1
temp=np.dot(source_mat,transform_mat)
target_mat=np.dot(temp.T,transform_mat)
target_mat_XM=target_mat
target_mat=np.zeros(matshape)
XM_time= time.time() - matrixtime
print((time.time() - matrixtime))
matrixtime = time.time()
for row in range(0,source_mat.shape[0]):
print 'row is ', row
#for column in range(0, source_mat.shape[1]):
for column in range(0, row):
rowmatch = np.array([vec1==row])
rowmatch = rowmatch*1
colmatch = np.array([vec1==column])
colmatch = colmatch*1
match_matrix=rowmatch*colmatch.T
target_mat=target_mat+(match_matrix*source_mat[row,column])
print((time.time() - matrixtime))
target_mat_fast=target_mat
target_mat=np.zeros(matshape)
target_mat_slow==target_mat_fast
target_mat_fast==target_mat_XM