I am using svm light files as a storage for sparse matrix.
A test shows that for a 31700108x54070 matrix with 570601944 entries
import xgboost as xgb
dtrain = xgb.DMatrix(train_file)
used 21seconds, way faster than
from sklearn.datasets import load_svmlight_file
x_train, y_train = load_svmlight_file(train_file)
used 7minutes.
Before hacking the code, anybody can help me answer this?
Do you have any suggestions to boost the load_svmlight_file function?
Thank you very much!
Xgboost is written in c++ and uses ctypes to wrap that in a python package. The implementation of load_svmlight_file
is written in cython, which takes python code and translates it to c. Ideally, cython would produce perfect c code, however sometimes it will produce code worse than what a c programmer would do.
The scikit people themselves acknowledge that load_svmlight_file
is not as efficient as it could be and point to another library written in c++.
This implementation is written in Cython and is reasonably fast. However, a faster API-compatible loader is also available at: https://github.com/mblondel/svmlight-loader