I have a dataset that is essentially a sparse binary matrix that represents relationships between elements of two sets. For example, let the 1st set be people (represented by their names), e.g. somehting like this:
people = set(['john','jane','mike','joe'])
and the 2nd set be a bunch of binary attributes, e.g.
attrs = set(['likes_coffee','has_curly_hair','has_dark_hair','drives_car','man_u_fan'])
The dataset is represented by a tab-separated data file that assigns some of the attributes to each person, e.g.
john likes_coffee
john drives_car
john has_curly_hair
jane has_curly_hair
jane man_u_fan
...
attrs
has about 30,000
elements, people
can be as big 6,000,000
, but the data is sparse, i.e. each person has at most 30-40 attributes
I am looking for a data structure/class in python that would allow me:
matrix
object representing the dataset from the corresponding data file {'has_curly_hair','likes_coffee','man_u_fan'}
"{'mike','joe'}
"My current implementation uses a pair of arrays for the two sets and a scipy
sparse matrix. So if
people = ['john','jane','mike','joe']
attrs = ['likes_coffee','has_curly_hair','has_dark_hair','drives_car','man_u_fan']
then I would create a sparse matrix data
of size 4 X 5
and the sample data shown above would correspond to elements
data[0,0]
data[0,3]
data[0,1]
data[1,1]
data[1,4]
...
I also maintain two inverse indices so that I don't have to invoke people.index('mike')
or attrs.index('has_curly_hair')
too often
This works OK but I have to maintain the indices explicitly. This is cumbersome, for instance, when I have two datasets with different sets of people and/or attributes and I need to match rows/columns corresponding to the same person/attribute from the two sparse matrices.
So is there an aternative that would allow me to avoid using integer indices and instead use actual elements of the two sets to extract rows/columns, i.e. something like
data['john',:] # give me all attributes of 'john'
data[:,['has_curly_hair','drives_car']] # give me all people who 'has_curly_hair' or 'drives_car'
?
Assuming that no library does exactly what you want, you can create your own class SparseMatrix
and overload the operator []
. Heres is one way to do it (the constructor might be different to what you want to have):
class SparseMatrix():
def __init__(self, x_label, y_label):
self.data = {}
for x,y in zip(x_label,y_label):
print x,y
self.data[x] = {}
for attr in y:
self.data[x][attr] = 1
return
def __getitem__(self, index):
x,y = index
if type(x) is str:
if type(y) is str:
return 1 if y in self.data[x] else 0
if type(y) is slice:
return self.data[x].keys()
if type(x) is slice:
if type(y) is str:
res = []
for key in self.data.keys():
if y in self.data[key]:
res.append(key)
return res
if type(y) is list:
res = []
for attr in y:
res += self.__getitem__((x,attr))
return res
And in the REPL, I get:
> data = SparseMatrix(['john','jane','mike','joe'],[['likes_coffee','has_curly_hair'],['has_dark_hair'],['drives_car'],['man_u_fan']])
> data['john',:]
['has_curly_hair', 'likes_coffee']
> data[:,['has_curly_hair','drives_car']]
['john', 'mike']