Search code examples
pythonsparse-matrix

Efficient way to populate a sparse matrix in Python


I am trying to set up a sparse matrix (dok_matrix) of journal co-occurences. Unfortunately, my solution is (too) inefficient to be of any use and I couldn't find any solution online.

EDIT: I would also like to create the sparse matrix directly, not by first creating a dense matrix and then turning it into a sparse matrix.

I start with a dataframe of how often certain journal are cited together. In this example, Nature and Science are cited together 3 times. I would like to end up with a sparse, symmetric matrix where the rows and columns are journals and the non-empty entries are how often these journals are cited together. I.e., here the full matrix would have four rows (Lancet, Nature, NEJM, Science) and four columns (Lancet, Nature, NEJM, Science) and three non-zero entries. Since my real data is much larger, I would like to use a sparse matrix representation.

What I currently do in my code is to update the non-zero entries with the values from my Dataframe. Unfortunately, the comparison of journal names is quite time-consuming and my question is, whether there is a quicker way of setting up a sparse matrix here.

My understanding is that my dataframe is close to a dok_matrix anyways, with the journal combination being equivalent to the tuple used as a key in the dok_matrix. However, I do not know how to make this transformation.

Any help is appreciated!

# Import packages
import pandas as pd
from scipy.sparse import dok_matrix

# Set up dataframe
d = {'journal_comb': ['Nature//// Science', 'NEJM//// Nature', 'Lancet//// NEJM'], 'no_combs': [3, 5, 6], 'journal_1': ['Nature', 'NEJM', 'Lancet'], 'journal_2': ['Science', 'Nature', 'NEJM']}
df = pd.DataFrame(d)

# Create list of all journal titles
journal_list = list(set(set(list(df['journal_1'])) | set(list(df['journal_2']))))
journal_list.sort()

# Set up empty sparse matrix with final size
S = dok_matrix((len(journal_list), len(journal_list)))

# Loop over all journal titles and get value from Dataframe for co-occuring journals
# Update sparse matrix value with value from Dataframe
for i in range(len(journal_list)):
    print i
    # Check whether journal name is actually in column 'journal_1'
    if len(df[(df['journal_1'] == journal_list[i])]) > 0:
    for j in range(len(journal_list)):
        # If clause to circumvent error due to empty series if journals are not co-cited
        if len(df[(df['journal_1'] == journal_list[i]) & (df['journal_2'] == journal_list[j])]['no_combs']) == 1:
            # Update value in sparse matrix
            S[i, j] = df[(df['journal_1'] == journal_list[i]) & (df['journal_2'] == journal_list[j])]['no_combs'].iloc[0]         

Solution

  • Use pandas first to shape your matrix -

    dok_matrix(pd.concat([df, df.rename(index=str, columns={'journal_1' : 'journal_2', 'journal_2' : 'journal_1'})], axis=0).pivot(index='journal_1', columns = 'journal_2', values = 'no_combs').as_matrix())
    

    I have first appended the reverse journal1 as journal 2, then pivoted to make the correct shape, then converted to matrix, and then to dok_matrix