Search code examples
pythonpandasnumpysparse-matrix

Parallelize populating ndarray from pandas series and csr matrix


Currently using a for loop to populate values from pandas series (category/object dtype) and csr matrix (numpy) to an ndarray and I was looking to speed things up

Sequential for loop (works), numba (doesn't like series and strings), joblib (slower than the sequential loop), swifter.apply (much slower as I have to use pandas but it does parallelize)

import pandas as pd
import numpy as np
from scipy.sparse import rand

nr_matches = 10**5
name_vector = pd.Series(pd.util.testing.rands_array(10, nr_matches))
matches = rand(nr_matches, 10, density = 0.2, format = 'csr')
non_zeros = matches.nonzero()
sparserows = non_zeros[0]
sparsecols = non_zeros[1]

left_side = np.empty([nr_matches], dtype = object)
right_side = np.empty([nr_matches], dtype = object)
similarity = np.zeros(nr_matches)

for index in range(0, nr_matches):
    left_side[index] = name_vector.iat[sparserows[index]]
    right_side[index] = name_vector.iat[sparsecols[index]]
    similarity[index] = matches.data[index]

No error messages but this is slow as it uses one thread!


Solution

  • as Divarak mentioned, slicing directly works

    matches_df["left_side"] = name_vector.iloc[sparserows].values
    matches_df["right_side"] = name_vector.iloc[sparsecols].values
    matches_df["similarity"] = matches.data