Search code examples
apache-sparkapache-spark-mllibapache-spark-ml

Using MatrixUDT as column in SparkSQL Dataframe


I'm trying to load set of medical images into spark SQL dataframe. Here each image is loaded into matrix column of dataframe. I see spark recently added MatrixUDT to support this kind of cases, but i don't find a sample for using in dataframe.

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/linalg/MatrixUDT.scala

Can anyone help me with this.

Really appreciate your help.

Thanks

Karthik Vadla


Solution

  • Actually MatrixUDT has been a part of the o.a.s.mllib.linalg since 1.4 and only recently has been copied to o.a.s.ml.linalg. Since it's never been public you cannot even declare a correct schema so I seriously doubt it is intended for general applications. Not to mention that API is arguably to limited to be useful in practice.

    Nevertheless basic conversions work just fine so all you need is a RDD or Seq of product types (once again it is not possible to define schema) and you're good to go:

    import org.apache.spark.ml.linalg.Matrices
    
    
    Seq((1, Matrices.dense(2, 2, Array(1, 2, 3, 4)))).toDF
    // org.apache.spark.sql.DataFrame = [_1: int, _2: matrix]
    
    Seq((1, Matrices.dense(2, 2, Array(1, 2, 3, 4)))).toDS
    // org.apache.spark.sql.Dataset[(Int, org.apache.spark.ml.linalg.Matrix)]
    //   = [_1: int, _2: matrix]