Search code examples
scalaapache-sparkdl4j

Converting Dataframe from Spark to the type used by DL4j


Is there any convenient way to convert Dataframe from Spark to the type used by DL4j? Currently using Daraframe in algorithms with DL4j I get an error: "type mismatch, expected: RDD[DataSet], actual: Dataset[Row]".


Solution

  • In general, we use datavec for that. I can point you at examples for that if you want. Dataframes make too many assumptions that make it too brittle to be used for real world deep learning.

    Beyond that, a data frame is not typically a good abstraction for representing linear algebra. (It falls down when dealing with images for example)

    We have some interop with spark.ml here: https://github.com/deeplearning4j/deeplearning4j/blob/master/deeplearning4j/deeplearning4j-scaleout/spark/dl4j-spark-ml/src/test/java/org/deeplearning4j/spark/ml/impl/SparkDl4jNetworkTest.java

    But in general, a dataset is just a pair of ndarrays just like numpy. If you have to use spark tools, and want to use ndarrays on the last mile only, then my advice would be to get the dataframe to match some form of schema that is purely numerical, map that to an ndarray "row".

    In general, a big reason we do this is because all of our ndarrays are off heap. Spark has many limitations when it comes to working with their data pipelines and using the JVM for things it shouldn't be(matrix math) - we took a different approach that allows us to use gpus and a bunch of other things efficiently.

    When we do that conversion, it ends up being: raw data -> numerical representation -> ndarray

    What you could do is map dataframes on to a double/float array and then use Nd4j.create(float/doubleArray) or you could also do: someRdd.map(inputFloatArray -> new DataSet(Nd4j.create(yourInputArray),yourLabelINDARray))

    That will give you a "dataset" You need a pair of ndarrays matching your input data and a label. The label from there is relative to the kind of problem you're solving whether that be classification or regression though.