Search code examples
javaapache-sparkapache-spark-sqlrddjava-pair-rdd

Convert JavaPairRDD to Dataframe in Spark Java API


I am using Spark 1.6 with Java 7

I have a pair RDD:

JavaPairRDD<String, String> filesRDD = sc.wholeTextFiles(args[0]);

I want to convert it into DataFrame with schema.

It seems that first I have to convert pairRDD to RowRDD.

So how to create RowRdd from PairRDD ?


Solution

  • For Java 7 you need to define a map function

    public static final Function<Tuple2<String, String>,Row> mappingFunc = (tuple) -> {
        return RowFactory.create(tuple._1(),tuple._2());
    };
    

    Now you can call this function to get JavaRDD<Row>

    JavaRDD<Row> rowRDD = filesRDD.map(mappingFunc);
    

    With Java 8 it is simply like

    JavaRDD<Row> rowRDD = filesRDD.map(tuple -> RowFactory.create(tuple._1(),tuple._2()));
    

    Another way to get Dataframe from JavaPairRDD is

    DataFrame df = sqlContext.createDataset(JavaPairRDD.toRDD(filesRDD), Encoders.tuple(Encoders.STRING(),Encoders.STRING())).toDF();