I am using Spark 1.6 with Java 7
I have a pair RDD:
JavaPairRDD<String, String> filesRDD = sc.wholeTextFiles(args[0]);
I want to convert it into DataFrame
with schema.
It seems that first I have to convert pairRDD to RowRDD.
So how to create RowRdd from PairRDD ?
For Java 7 you need to define a map function
public static final Function<Tuple2<String, String>,Row> mappingFunc = (tuple) -> {
return RowFactory.create(tuple._1(),tuple._2());
};
Now you can call this function to get JavaRDD<Row>
JavaRDD<Row> rowRDD = filesRDD.map(mappingFunc);
With Java 8 it is simply like
JavaRDD<Row> rowRDD = filesRDD.map(tuple -> RowFactory.create(tuple._1(),tuple._2()));
Another way to get Dataframe from JavaPairRDD is
DataFrame df = sqlContext.createDataset(JavaPairRDD.toRDD(filesRDD), Encoders.tuple(Encoders.STRING(),Encoders.STRING())).toDF();