Search code examples
pythonnumpyscikit-learnsparse-matrixsklearn-pandas

Create Sparse Matrix in Python


Working with data and would like to create a sparse matrix to later be used for clustering purposes.

fileHandle = open('data', 'r')

for line in fileHandle:
    json_list = []
    fields = line.split('\t')
    json_list.append(fields[0])
    json_list.append(fields[1])
    json_list.append(fields[3])

Right now the data looks like this:

term, ids, quantity
['buick', '123,234', '500']
['chevy', '345,456', '300']
['suv','123', '100']

The output I would need would be like this:

term, quantity, '123', '234', '345', '456', '567'
buick, 500, 1, 1, 0, 0, 0
chevy, 300, 0, 0, 1, 1, 0
suv,   100, 1, 0, 0, 0, 0

I've tried working with numpy sparse matrix library but with no success.


Solution

  • scikit_learn probably has the tools to do this easily, but I'll demonstrate a basic Python/numpy solution.

    The raw data - a list of lists

    In [1150]: data=[['buick', '123,234', '500'],
                     ['chevy', '345,456', '300'],
                     ['suv','123', '100']]
    

    I can pull out verious columns with list comprehensions. This might not be the fastest in a very large case, but for now it's an easy way to tackle the issue piece by piece.

    In [1151]: terms=[row[0] for row in data]
    
    In [1152]: terms
    Out[1152]: ['buick', 'chevy', 'suv']
    
    In [1153]: quantities=[int(row[2]) for row in data]
    
    In [1154]: quantities
    Out[1154]: [500, 300, 100]
    

    Create the list of possible ids. I could pull these from data, but you apparently are using a larger list. They could be strings instead of ints.

    In [1155]: idset=[123,234,345,456,567]
    
    In [1156]: ids=[[int(i) for i in row[1].split(',')] for row in data]
    
    In [1157]: ids
    Out[1157]: [[123, 234], [345, 456], [123]]
    

    np.in1d is a handy tool for finding where those sublists fit in the master list. The resulting idM is the feature matrix, with lots of 0s and a few ones.

    In [1158]: idM=np.array([np.in1d(idset,i) for i in ids],int)
    
    In [1159]: idM
    Out[1159]: 
    array([[1, 1, 0, 0, 0],
           [0, 0, 1, 1, 0],
           [1, 0, 0, 0, 0]])
    

    We could assemble the pieces in various ways.

    For example a structured array could be created with:

    In [1161]: M=np.zeros(len(data),dtype='U10,int,(5)int')
    
    In [1162]: M['f0']=terms
    
    In [1163]: M['f1']=quantities
    
    In [1164]: M['f2']=idM
    
    In [1165]: M
    Out[1165]: 
    array([('buick', 500, [1, 1, 0, 0, 0]), ('chevy', 300, [0, 0, 1, 1, 0]),
           ('suv', 100, [1, 0, 0, 0, 0])], 
          dtype=[('f0', '<U10'), ('f1', '<i4'), ('f2', '<i4', (5,))])
    

    idM could be turned into a sparse matrix with:

    In [1167]: from scipy import sparse
    
    In [1168]: c=sparse.coo_matrix(idM)
    
    In [1169]: c
    Out[1169]: 
    <3x5 sparse matrix of type '<class 'numpy.int32'>'
        with 5 stored elements in COOrdinate format>
    
    In [1170]: c.A
    Out[1170]: 
    array([[1, 1, 0, 0, 0],
           [0, 0, 1, 1, 0],
           [1, 0, 0, 0, 0]])
    

    In this exploration it was easier to create the denser array first, and make a sparse from that.

    But sparse provides a bmat function that lets me create the multirow matrix from a list of single row ones. (see my edit history for a version that constructs the coo inputs directly)

    In [1220]: ll=[[sparse.coo_matrix(np.in1d(idset,i),dtype=int)] for i in ids]
    
    In [1221]: sparse.bmat(ll)
    Out[1221]: 
    <3x5 sparse matrix of type '<class 'numpy.int32'>'
        with 5 stored elements in COOrdinate format>
    
    In [1222]: sparse.bmat(ll).A
    Out[1222]: 
    array([[1, 1, 0, 0, 0],
           [0, 0, 1, 1, 0],
           [1, 0, 0, 0, 0]], dtype=int32)