Search code examples
pythonindexingscipysparse-matrix

python: can I have a sparse matrix representation without (explicitly) using integer indices?


I have a dataset that is essentially a sparse binary matrix that represents relationships between elements of two sets. For example, let the 1st set be people (represented by their names), e.g. somehting like this:

people = set(['john','jane','mike','joe'])

and the 2nd set be a bunch of binary attributes, e.g.

attrs = set(['likes_coffee','has_curly_hair','has_dark_hair','drives_car','man_u_fan'])

The dataset is represented by a tab-separated data file that assigns some of the attributes to each person, e.g.

john    likes_coffee
john    drives_car
john    has_curly_hair
jane    has_curly_hair
jane    man_u_fan
...

attrs has about 30,000 elements, people can be as big 6,000,000, but the data is sparse, i.e. each person has at most 30-40 attributes

I am looking for a data structure/class in python that would allow me:

  • To quickly create a matrix object representing the dataset from the corresponding data file
  • To be able to quickly extract individual elements of the matrix as well as blocks of its rows and columns. For example, I want to answer questions like
    • "Give me a list of all people with {'has_curly_hair','likes_coffee','man_u_fan'}"
    • "Give me a union of attributes of {'mike','joe'}"

My current implementation uses a pair of arrays for the two sets and a scipy sparse matrix. So if

people = ['john','jane','mike','joe']
attrs = ['likes_coffee','has_curly_hair','has_dark_hair','drives_car','man_u_fan']

then I would create a sparse matrix data of size 4 X 5 and the sample data shown above would correspond to elements

data[0,0]
data[0,3]
data[0,1]
data[1,1]
data[1,4]
...

I also maintain two inverse indices so that I don't have to invoke people.index('mike') or attrs.index('has_curly_hair') too often

This works OK but I have to maintain the indices explicitly. This is cumbersome, for instance, when I have two datasets with different sets of people and/or attributes and I need to match rows/columns corresponding to the same person/attribute from the two sparse matrices.

So is there an aternative that would allow me to avoid using integer indices and instead use actual elements of the two sets to extract rows/columns, i.e. something like

data['john',:]  # give me all attributes of 'john'
data[:,['has_curly_hair','drives_car']] # give me all people who 'has_curly_hair' or 'drives_car'

?


Solution

  • Assuming that no library does exactly what you want, you can create your own class SparseMatrix and overload the operator []. Heres is one way to do it (the constructor might be different to what you want to have):

    class SparseMatrix():
        def __init__(self, x_label, y_label):
            self.data = {}
            for x,y in zip(x_label,y_label):
                print x,y
                self.data[x] = {}
                for attr in y:
                    self.data[x][attr] = 1
            return
    
        def __getitem__(self, index):
            x,y = index
            if type(x) is str:
                if type(y) is str:
                    return 1 if y in self.data[x] else 0
                if type(y) is slice:
                    return self.data[x].keys()
            if type(x) is slice:
                if type(y) is str:
                    res = []
                    for key in self.data.keys():
                        if y in self.data[key]:
                            res.append(key)
                    return res
                if type(y) is list:
                    res = []
                    for attr in y:
                        res += self.__getitem__((x,attr))
                    return res
    

    And in the REPL, I get:

    > data = SparseMatrix(['john','jane','mike','joe'],[['likes_coffee','has_curly_hair'],['has_dark_hair'],['drives_car'],['man_u_fan']])
    
    > data['john',:]
    ['has_curly_hair', 'likes_coffee']
    
    > data[:,['has_curly_hair','drives_car']]
    ['john', 'mike']