I have been working with some R
packages that calculate (cosine) (sparse) similarity matrices from sparse binary matrices, e.g. proxyC
.
As I am now starting (and learning) to use python
as well, and I was told it might even be faster, I would like to try and run the same calculations there.
I found this interesting post:
What's the fastest way in Python to calculate cosine similarity given sparse matrix data?
which describes a few methods.
I did try some of them out after writing out a small test matrix myself by hand.
Now I would like to try on 'real' data.
And that's where I encounter a problem I currently cannot solve.
My data come in tsv files that associate objects (ID's) to comma-separated lists of features (FP's). E.g.:
ID FP
1 A,B,C
2 A,D
3 C,D,F
4 A,F
5 E,H,M
I need to convert this to a sparse binary matrix.
Even in R it took me some time to figure out the best way to do it.
I first strsplit
the FP
lists by comma, turning the FP
column from a character vector to a list of character vectors. Then I unlist
FP
, repeating each ID
as many times as the lengths
of the FP
vectors, which gives me this:
ID FP
1 A
1 B
1 C
2 A
2 D
3 C
3 D
3 F
4 A
4 F
5 E
5 H
5 M
And I make the sparse binary feature matrix by xtabs
:
5 x 8 sparse Matrix of class "dgCMatrix"
FP
ID A B C D E F H M
1 1 1 1 . . . . .
2 1 . . 1 . . . .
3 . . 1 1 . 1 . .
4 1 . . . . 1 . .
5 . . . . 1 . 1 1
I am sure it is possible to do this in python
(in this case going from the tsv file to a csr matrix, as in the post I linked), but I am still a beginner, and I suspect it would take me a very long time to figure out all the details and get it right.
Would anyone be able to help / point me to posts describing the necessary steps with examples?
Thanks!
import pandas as pd
df = pd.DataFrame({'ID':[1,2,3], 'FP':["A,B,C","A,D","C,D,F"]})
>>> df
ID FP
0 1 A,B,C
1 2 A,D
2 3 C,D,F
Split the column and explode it to a long table
df['FP'] = df['FP'].str.split(",")
df = df.explode(column="FP")
>>> df
ID FP
0 1 A
0 1 B
0 1 C
1 2 A
1 2 D
2 3 C
2 3 D
2 3 F
Encode the categorical column
df['FP'] = df['FP'].astype('category')
Write it into a sparse matrix:
from scipy.sparse import csr_matrix
import numpy as np
mat = csr_matrix((np.ones(df.shape[0]), (df['ID'], df['FP'].cat.codes)))
>>> mat.A
array([[0., 0., 0., 0., 0.],
[1., 1., 1., 0., 0.],
[1., 0., 0., 1., 0.],
[0., 0., 1., 1., 1.]])
Make sure to keep track of which columns are which categorical levels. You can also encode the ID
column if you'd prefer (if they're not 0-indexed integers it might be a good idea).
df['ID'] = df['ID'].astype('category')
mat = csr_matrix((np.ones(df.shape[0]), (df['ID'].cat.codes, df['FP'].cat.codes)))
>>> mat.A
array([[1., 1., 1., 0., 0.],
[1., 0., 0., 1., 0.],
[0., 0., 1., 1., 1.]])
Again, keep track of your categorical levels.