Search code examples
pythonmatrixindexingvectorizationdimensionality-reduction

How to vectorize populating larger matrix with items of smaller matrix in python


I have some small symmetric matrices that are low dimensional representations of larger symmetric matrices. I have a vector that is a key showing which cells of the high-D matrix should be linked to which cells in the low-D matrix.

I would like to recreate these larger matrices by populating the larger matrix with its corresponding value in the low dimensional matrix. I believe there should be a vectorized approach to this, but so far all I've been able to come up with is a simple nested for loop, which is prohibitively slow for these matrices (10k+ rows & columns).

In this toy example, key is vec1, low-D matrix is source_mat, and high-D matrix is target_mat. I need to create target_mat where each cell is filled with the corresponding value from source_mat according to the key.

    import pandas as pd
    import numpy as np
    import random

    vec1=[]
    for x in range (0, 100):
        vec1.append(random.randint(0, 19)) #creating the key

    vec1=pd.DataFrame(vec1)
    sizevec1=vec1.shape[0]
    matshape=(sizevec1,sizevec1)
    target_mat=np.zeros(matshape) #key and target have same shape
    target_mat=pd.DataFrame(target_mat)

    temp=np.random.random((20,20))
    source_mat=temp*temp.T

    for row in range(0,target_mat.shape[0]):
        for column in range(0,target_mat.shape[1]):
            print 'row is ', row
            print 'column is', column
            target_mat.iloc[row,column] = source_mat.item(int(vec1.iloc[row]), int(vec1.iloc[column]))

Solution

  • Below are two separate updates to the code that led to pretty dramatically speedups.

    First- figured out the vectorized solution, so now the calculation is done all in one step. This is the fastest method even after the second change-

    Second- changed all pandas dataframe's to numpy arrays. This change had the greatest impact on the for loop code- which runs orders of magnitude faster now.

    The code below calculates all 3 of the methods, the 'slow', 'fast', and 'Xu Mackenzie', named for friends that thought up the vectorized solution ;-P

    #Initialize Variables

    import time
    import random
    import pandas as pd
    import numpy as np
    
    n=13000
    k=2000
    i=0
    vec1=[]
    for x in range(0, n):
       vec1.append(random.randint(0, k-1))
    
    temp=np.random.random((k,k))
    #vec1=pd.DataFrame(vec1)
    vec1=np.array(vec1)
    #vec=pd.DataFrame(np.arange(0,300))
    #vec2=pd.concat([vec,vec1], axis=1)
    #sizevec1=vec1.shape[0]
    sizevec1=len(vec1)
    matshape=(sizevec1,sizevec1)
    target_mat=np.zeros(matshape)
    #target_mat=pd.DataFrame(target_mat)
    
    
    source_mat=temp*temp.T
    transform_mat=np.zeros((len(source_mat),len(target_mat)))
    

    Slow Solution

    matrixtime = time.time()
    for row in range(0,target_mat.shape[0]):
        #print 'row is ', row
        for column in range(0,target_mat.shape[1]):
    
            #print 'column is', column
            target_mat[row,column] = source_mat.item(int(vec1[row]), int(vec1[column]))
    print((time.time() - matrixtime))
    target_mat_slow=target_mat
    target_mat=np.zeros(matshape)
    

    XU MACKENZIE SOLUTION

    matrixtime = time.time()
    
    for i in range(0,len(target_mat)):
      transform_mat[vec1[i],i]=1
    
    temp=np.dot(source_mat,transform_mat)
    target_mat=np.dot(temp.T,transform_mat)
    target_mat_XM=target_mat
    target_mat=np.zeros(matshape)
    XM_time= time.time() - matrixtime
    print((time.time() - matrixtime))
    

    Previous 'fast' solution

    matrixtime = time.time()
    for row in range(0,source_mat.shape[0]):
        print 'row is ', row
        #for column in range(0, source_mat.shape[1]):
        for column in range(0, row):   
            rowmatch = np.array([vec1==row])
            rowmatch = rowmatch*1
    
            colmatch = np.array([vec1==column])
            colmatch = colmatch*1
    
            match_matrix=rowmatch*colmatch.T
            target_mat=target_mat+(match_matrix*source_mat[row,column])
    
    print((time.time() - matrixtime))
    target_mat_fast=target_mat
    target_mat=np.zeros(matshape)
    

    TEST FOR EQUIVALENCE

    target_mat_slow==target_mat_fast
    target_mat_fast==target_mat_XM