I am using SVC classifier with Linear kernel to train my model. Train data: 42000 records
model = SVC(probability=True)
model.fit(self.features_train, self.labels_train)
y_pred = model.predict(self.features_test)
train_accuracy = model.score(self.features_train,self.labels_train)
test_accuracy = model.score(self.features_test, self.labels_test)
It takes more than 2 hours to train my model. Am I doing something wrong? Also, what can be done to improve the time
Thanks in advance
There are several possibilities to speed up your SVM training. Let n
be the number of records, and d
the embedding dimensionality. I assume you use scikit-learn
.
Reducing training set size. Quoting the docs:
The fit time complexity is more than quadratic with the number of samples which makes it hard to scale to dataset with more than a couple of 10000 samples.
O(n^2)
complexity will most likely dominate other factors. Sampling fewer records for training will thus have the largest impact on time. Besides random sampling, you could also try instance selection methods. For example, principal sample analysis has been proposed recently.
Reducing dimensionality. As others have hinted at in their comments, embedding dimension also impacts runtime. Computing inner products for the linear kernel is in O(d)
. Dimensionality reduction can, therefore, also reduce runtime. In another question, latent semantic indexing was suggested specifically for TF-IDF representations.
SVC(probability=False)
unless you need the probabilities, because they "will slow down that method." (from the docs).Different classifier. You may try sklearn.svm.LinearSVC
, which is...
[s]imilar to SVC with parameter kernel=’linear’, but implemented in terms of liblinear rather than libsvm, so it has more flexibility in the choice of penalties and loss functions and should scale better to large numbers of samples.
Moreover, a scikit-learn dev suggested the kernel_approximation
module in a similar question.