I have a large data set (10 000
rows), where each row (sample) is represented by a list of bits (~200 000
bits).Each bit represent the absence or presence of the feature in the sample. So, it's a large (10 000 x 200 000
) high-dimensional sparse data set
In order to save some memory space, for each sample, I'm only saving the indices of the non zero bits. Example for a vector with 7 features:
[0, 0, 1, 0, 0, 1, 1] ===> [2, 5, 6]
I'm doing this for all the data set. Let the result be X
(10 000
variable size vectors). Exemple for an initial data set 3x4
:
[[0,0,1,0], [[2],
initial_data= [0,1,1,0], ===> [1,2], = X
[0,1,0,1]] [1,3]]
Each row is labeled with either of the two labels: malignant
or benign
.
A linear support vector classification model (the one in sklearn.svm.LinearSVC
) is trained on the data represented by X
. Knowing that the aforementioned model accepts sparse input and there is seven representation possible in SciPy
:
Which representation is more efficient for training the model ? and how can I efficiently pass from X
to that representation ?
csr is the way to go, which is supported by sklearn's sources. Excerpt:
class LinearSVC(BaseEstimator, LinearClassifierMixin,
_LearntSelectorMixin, SparseCoefMixin):
...
...
X, y = check_X_y(X, y, accept_sparse='csr',
dtype=np.float64, order="C")
csr and many other formats are not recommended for building a sparse-matrix directly (adding stuff / changing the sparsity-structure is very costly).
Use dok_matrix / lil_matrix to build a sparse-matrix from your data (should be simple) and then convert (which is done in linear-time).
X = X.tocsr()
Also keep in mind, that all the data you pass is converted internally as liblinear, the external library used by sklearn has it's own data-structures. So if you pass the wrong format; it's a one-time conversion cost which should occur. The pure training-procedure does not care!