Search code examples
scalaapache-sparkdeeplearning4jnd4j

Input Spark Dataframe to DeepLearning4J model


I've data in my spark dataframe (df) which have 24 features and the 25th column is my target variable. I want to fit my dl4j model on this dataset which takes input in the form of org.nd4j.linalg.api.ndarray.INDArray, org.nd4j.linalg.dataset.Dataset or org.nd4j.linalg.dataset.api.iterator.DataSetIterator. How can I convert my dataframe to the required type ?

I've also tried using Pipeline method to input spark dataframe to the model directly. But sbt dependency of dl4j-spark-ml is not working. My build.sbt file is :

scalaVersion := "2.11.8"

libraryDependencies += "org.deeplearning4j" %% "dl4j-spark-ml" % "0.8.0_spark_2-SNAPSHOT"

libraryDependencies += "org.deeplearning4j" % "deeplearning4j-core" % "0.8.0"

libraryDependencies += "org.nd4j" % "nd4j" % "0.8.0"

libraryDependencies += "org.nd4j" % "nd4j-native-platform" % "0.8.0"

libraryDependencies += "org.nd4j" % "nd4j-backends" % "0.8.0"

libraryDependencies += "org.apache.spark" %% "spark-core" % "2.0.1"

libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.0.1" 

Can someone guide me from here ? Thanks in advance.


Solution

  • You can use snapshots which have readded the spark.ml integration. If you want to use snapshots, add the oss sonatype repository: https://github.com/deeplearning4j/dl4j-examples/blob/master/dl4j-examples/pom.xml#L16 The version at the time of this writing is: 0.8.1-SNAPSHOT

    Please verify the latest version with the examples repo though: https://github.com/deeplearning4j/dl4j-examples/blob/master/pom.xml#L21

    You can't mix versions of dl4j. The version you're trying to use is very out of date (by more than a year). Please upgrade to the latest version beyond that.

    The new spark.ml integration examples can be found here: https://github.com/deeplearning4j/deeplearning4j/tree/master/deeplearning4j-scaleout/spark/dl4j-spark-ml/src/test/java/org/deeplearning4j/spark/ml/impl

    Make sure to add the proper dependency, which is typically something like org.deeplearning4j:dl4j-spark-ml_${YOUR SCALA BINARY VERSION}:0.8.1_spark_${YOUR SPARK VERSION (1 or 2}-SNAPSHOT