I would like to clarify: does vaex.ml.sklearn
allows to perform out-of-core ML?
I try to use examples from documentation and see that if I use dataset from hdf5 file (evaluated dataset consumes ~3 Gb of RAM) in xgboosting procees RAM usage is around ~7-8 Gb. Naively, I assumed that out-of-core do not consume so much RAM. What do I wrong?
My code is
import vaex.ml.sklearn
xgb_model = xgboost.sklearn.XGBRegressor(max_depth=4,
learning_rate=0.1,
n_estimators=100,
subsample=0.75,
random_state=42,
)
vaex_xgb_model = vaex.ml.sklearn.Predictor(features=features,
target='target',
model=xgb_model,
prediction_name='prediction_xgb')
vaex_xgb_model.fit(df_train)
df_train = vaex_xgb_model.transform(df_train)
where features
is the list of ~40 items.
The external models you are using that are not provided by vaex (or vaex-ml) come "as it". When using them in vaex-ml you simply get a convenient way to adding them in the vaex computational graph, serialization, lazy evaluation etc. The models themselves are unmodified (I believe this is stated in the docstrings). So they are not out of core.
I think vaex-ml has for example a K-means model that is implemented in vaex, so that one would be out of core (i.e. will not use much memory).
The preprocessing transformations like StandardScaler, FrequencyEncoder and so on available in vaex-ml, are implemented using vaex, so those would be out-of-core as well.