Why Dmatrix from xgboost loads svm light text files so fast

I am using svm light files as a storage for sparse matrix.

A test shows that for a 31700108x54070 matrix with 570601944 entries

import xgboost as xgb
dtrain = xgb.DMatrix(train_file)

used 21seconds, way faster than

from sklearn.datasets import load_svmlight_file
x_train, y_train = load_svmlight_file(train_file)

used 7minutes.

Before hacking the code, anybody can help me answer this?

Do you have any suggestions to boost the load_svmlight_file function?

Thank you very much!

Solution

Xgboost is written in c++ and uses ctypes to wrap that in a python package. The implementation of load_svmlight_file is written in cython, which takes python code and translates it to c. Ideally, cython would produce perfect c code, however sometimes it will produce code worse than what a c programmer would do.

The scikit people themselves acknowledge that load_svmlight_file is not as efficient as it could be and point to another library written in c++.

This implementation is written in Cython and is reasonably fast. However, a faster API-compatible loader is also available at: https://github.com/mblondel/svmlight-loader