Search code examples
scalaapache-spark-sqlsparse-matrixdatabricksxgboost

XGBoost4J - Scala dataframe to sparse dmatrix


What is the most efficient and scalable way to convert a scala dataframe to a sparse dmatrix for XGBoost4J?

Say I have a dataframe train with columns row_index, column_index, and value, it would be something like

new DMatrix(train.select("row_index"), train.select("column_index"), train.select("Value"), DMatrix.SparseType.CSR, n_col)

However the above code results in a type mismatch because DMatrix expects Array[Long].

train.select(F.collect_list("row_index")).first().getList[Long](0) seems like a possible option but it doesn't seem to be memory friendly and scalable.

I am doing this on Databricks so solutions in the other supported languages (python, SQL, scala) are welcome.


Solution

  • The answer was to use sparse vectors by row rather than trying to create sparse matrix or dmatrix.

    train.rdd.map(r => (r.getInt(0), (r.getInt(1), r.getInt(2).toDouble))).groupByKey().map(r => (r._1, Vectors.sparse(n_col, r._2.toSeq))).toDF

    I tested scoring a sample of the data in R using Matrix::sparseMatrix and xgboost::dmatrix and the results matched up.