scala apache-spark-sql sparse-matrix databricks xgboost

XGBoost4J - Scala dataframe to sparse dmatrix

What is the most efficient and scalable way to convert a scala dataframe to a sparse dmatrix for XGBoost4J?

Say I have a dataframe train with columns row_index, column_index, and value, it would be something like

new DMatrix(train.select("row_index"), train.select("column_index"), train.select("Value"), DMatrix.SparseType.CSR, n_col)

However the above code results in a type mismatch because DMatrix expects Array[Long].

train.select(F.collect_list("row_index")).first().getList[Long](0) seems like a possible option but it doesn't seem to be memory friendly and scalable.

I am doing this on Databricks so solutions in the other supported languages (python, SQL, scala) are welcome.

Solution

The answer was to use sparse vectors by row rather than trying to create sparse matrix or dmatrix.

train.rdd.map(r => (r.getInt(0), (r.getInt(1), r.getInt(2).toDouble))).groupByKey().map(r => (r._1, Vectors.sparse(n_col, r._2.toSeq))).toDF

I tested scoring a sample of the data in R using Matrix::sparseMatrix and xgboost::dmatrix and the results matched up.