Search code examples
pythonscipyscikit-learnsparse-matrix

Which sparse Matrix representation to use with sklearn.svm.LinearSVC


I have a large data set (10 000 rows), where each row (sample) is represented by a list of bits (~200 000 bits).Each bit represent the absence or presence of the feature in the sample. So, it's a large (10 000 x 200 000) high-dimensional sparse data set

In order to save some memory space, for each sample, I'm only saving the indices of the non zero bits. Example for a vector with 7 features:

[0, 0, 1, 0, 0, 1, 1] ===> [2, 5, 6]

I'm doing this for all the data set. Let the result be X (10 000 variable size vectors). Exemple for an initial data set 3x4:

                 [[0,0,1,0],       [[2],
  initial_data=   [0,1,1,0],  ===>  [1,2],   = X
                  [0,1,0,1]]        [1,3]]

Each row is labeled with either of the two labels: malignantor benign. A linear support vector classification model (the one in sklearn.svm.LinearSVC) is trained on the data represented by X. Knowing that the aforementioned model accepts sparse input and there is seven representation possible in SciPy:

  • csc_matrix: Compressed Sparse Column format
  • csr_matrix: Compressed Sparse Row format
  • bsr_matrix: Block Sparse Row format
  • lil_matrix: List of Lists format
  • dok_matrix: Dictionary of Keys format
  • coo_matrix: COOrdinate format (aka IJV, triplet format)
  • dia_matrix: DIAgonal format

Which representation is more efficient for training the model ? and how can I efficiently pass from X to that representation ?


Solution

  • csr is the way to go, which is supported by sklearn's sources. Excerpt:

    class LinearSVC(BaseEstimator, LinearClassifierMixin,
                _LearntSelectorMixin, SparseCoefMixin):
        ...
        ...
        X, y = check_X_y(X, y, accept_sparse='csr',
                         dtype=np.float64, order="C")
    

    csr and many other formats are not recommended for building a sparse-matrix directly (adding stuff / changing the sparsity-structure is very costly).

    Use dok_matrix / lil_matrix to build a sparse-matrix from your data (should be simple) and then convert (which is done in linear-time).

    X = X.tocsr()
    

    Also keep in mind, that all the data you pass is converted internally as liblinear, the external library used by sklearn has it's own data-structures. So if you pass the wrong format; it's a one-time conversion cost which should occur. The pure training-procedure does not care!