What is the most efficient and scalable way to convert a scala dataframe to a sparse dmatrix for XGBoost4J?
Say I have a dataframe train
with columns row_index
, column_index
, and value
, it would be something like
new DMatrix(train.select("row_index"), train.select("column_index"), train.select("Value"), DMatrix.SparseType.CSR, n_col)
However the above code results in a type mismatch because DMatrix expects Array[Long]
.
train.select(F.collect_list("row_index")).first().getList[Long](0)
seems like a possible option but it doesn't seem to be memory friendly and scalable.
I am doing this on Databricks so solutions in the other supported languages (python, SQL, scala) are welcome.
The answer was to use sparse vectors by row rather than trying to create sparse matrix or dmatrix.
train.rdd.map(r => (r.getInt(0), (r.getInt(1), r.getInt(2).toDouble))).groupByKey().map(r => (r._1, Vectors.sparse(n_col, r._2.toSeq))).toDF
I tested scoring a sample of the data in R
using Matrix::sparseMatrix
and xgboost::dmatrix
and the results matched up.