I'm trying to load set of medical images into spark SQL dataframe. Here each image is loaded into matrix column of dataframe. I see spark recently added MatrixUDT to support this kind of cases, but i don't find a sample for using in dataframe.
Can anyone help me with this.
Really appreciate your help.
Thanks
Karthik Vadla
Actually MatrixUDT
has been a part of the o.a.s.mllib.linalg
since 1.4 and only recently has been copied to o.a.s.ml.linalg
. Since it's never been public you cannot even declare a correct schema so I seriously doubt it is intended for general applications. Not to mention that API is arguably to limited to be useful in practice.
Nevertheless basic conversions work just fine so all you need is a RDD or Seq
of product types (once again it is not possible to define schema) and you're good to go:
import org.apache.spark.ml.linalg.Matrices
Seq((1, Matrices.dense(2, 2, Array(1, 2, 3, 4)))).toDF
// org.apache.spark.sql.DataFrame = [_1: int, _2: matrix]
Seq((1, Matrices.dense(2, 2, Array(1, 2, 3, 4)))).toDS
// org.apache.spark.sql.Dataset[(Int, org.apache.spark.ml.linalg.Matrix)]
// = [_1: int, _2: matrix]