Search code examples
apache-sparkpysparkjupyter-notebookpy4j

Pyspark Jupyter - dataframe created in java code vs python code


I have a class in java that builds some sophisticated Spark DataFrame.

package companyX;

class DFBuilder {
   public DataFrame build() {
       ...
       return dataframe;
   }
}

I add this class to the pyspark/jupiter classpath so its callable by py4j. Now when I call it I get strange type:

b = sc._jvm.companyX.DFBuilder()
print(type(b.build()))
#prints: py4j.java_gateway.JavaObject

VS

print(type(sc.parallelize([]).toDF()))
#prints: pyspark.sql.dataframe.DataFrame

Is there a way to convert this JavaObject into proper pyspark dataframe? One of the problems I have is that when I want to call df.show() on a DataFrame build in java is that it gets printed in spark logs, and not in notebook cell.


Solution

  • You can use DataFrame initializer:

    from pyspark.sql import DataFrame, SparkSession
    
    spark = SparkSession.builder.getOrCreate()
    
    DataFrame(b.build(), spark)
    

    If you use outdated Spark version replace SparkSession instance with SQLContext.

    Refference Zeppelin: Scala Dataframe to python