Search code examples
numpyscipysparse-matrix

How can I transfer an sparse representaion of .txt to a dense matrix in scipy?


I have a .txt file from epinion data set which is a sparse representation (ie. 23 387 5 represents the fact "user 23 has rated item 387 as 5") . from this sparse format I want to transfer it to its dense Representation scipy so I can do matrix factorization on it.

I have loaded the file with loadtxt() from numpy and it is a [664824, 3] array. Using scipy.sparse.csr_matrix I transfer it to numpy array and using todense() from scipy I was hoping to achieve the dense format but I always get the same matrix: [664824, 3]. How can I turn it into the original [40163,139738] dense representation?

import numpy as np
from io import StringIO

d = np.loadtxt("MFCode/Epinions_dataset.txt") 
S = csr_matrix(d)
D = R.todense()

I expected a dense matrix with the shape of [40163,139738]


Solution

  • A small sample csv like text:

    In [218]: np.lib.format.open_memmap?                                            
    In [219]: txt = """0 1 3 
         ...: 1 0 4 
         ...: 2 2 5 
         ...: 0 3 6""".splitlines()                                                 
    In [220]: data = np.loadtxt(txt)                                                
    In [221]: data                                                                  
    Out[221]: 
    array([[0., 1., 3.],
           [1., 0., 4.],
           [2., 2., 5.],
           [0., 3., 6.]])
    

    Using sparse, using the (data, (row, col)) style of input:

    In [222]: from scipy import sparse                                              
    In [223]: M = sparse.coo_matrix((data[:,2], (data[:,0], data[:,1])), shape=(5,4))                                                                     
    In [224]: M                                                                     
    Out[224]: 
    <5x4 sparse matrix of type '<class 'numpy.float64'>'
        with 4 stored elements in COOrdinate format>
    In [225]: M.A                                                                   
    Out[225]: 
    array([[0., 3., 0., 6.],
           [4., 0., 0., 0.],
           [0., 0., 5., 0.],
           [0., 0., 0., 0.],
           [0., 0., 0., 0.]])
    

    Alternatively fill in a zeros array directly:

    In [226]: arr = np.zeros((5,4))                                                 
    In [227]: arr[data[:,0].astype(int), data[:,1].astype(int)]=data[:,2]           
    In [228]: arr                                                                   
    Out[228]: 
    array([[0., 3., 0., 6.],
           [4., 0., 0., 0.],
           [0., 0., 5., 0.],
           [0., 0., 0., 0.],
           [0., 0., 0., 0.]])
    

    But be ware that np.zeros([40163,139738]) could raise a memory error. M.A (M.toarray())` could also do that.