apache-spark pyspark jupyter-notebook py4j

Pyspark Jupyter - dataframe created in java code vs python code

I have a class in java that builds some sophisticated Spark DataFrame.

package companyX;

class DFBuilder {
   public DataFrame build() {
       ...
       return dataframe;
   }
}

I add this class to the pyspark/jupiter classpath so its callable by py4j. Now when I call it I get strange type:

b = sc._jvm.companyX.DFBuilder()
print(type(b.build()))
#prints: py4j.java_gateway.JavaObject

print(type(sc.parallelize([]).toDF()))
#prints: pyspark.sql.dataframe.DataFrame

Is there a way to convert this JavaObject into proper pyspark dataframe? One of the problems I have is that when I want to call df.show() on a DataFrame build in java is that it gets printed in spark logs, and not in notebook cell.

Solution

You can use DataFrame initializer:

from pyspark.sql import DataFrame, SparkSession

spark = SparkSession.builder.getOrCreate()

DataFrame(b.build(), spark)

If you use outdated Spark version replace SparkSession instance with SQLContext.

Refference Zeppelin: Scala Dataframe to python