Search code examples
javaapache-sparkrdd

JavaRDD<String> to JavaRDD<Row>


I am reading a txt file as a JavaRDD with the following command:

JavaRDD<String> vertexRDD = ctx.textFile(pathVertex);

Now, I would like to convert this to a JavaRDD because in that txt file I have two columns of Integers and want to add some schema to the rows after splitting the columns.

I tried also this:

JavaRDD<Row> rows = vertexRDD.map(line -> line.split("\t"))

But is says I cannot assign the map function to an "Object" RDD

  1. How can I create a JavaRDD out of a JavaRDD
  2. How can I use map to the JavaRDD?

Thanks!


Solution

  • Creating a JavaRDD out of another is implicit when you apply a transformation such as map. Here, the RDD you create is a RDD of arrays of strings (result of split).

    To get a RDD of rows, just create a Row from the array:

    JavaRDD<String> vertexRDD = ctx.textFile("");
    JavaRDD<String[]> rddOfArrays = vertexRDD.map(line -> line.split("\t"));
    JavaRDD<Row> rddOfRows =rddOfArrays.map(fields -> RowFactory.create(fields));
    

    Note that if your goal is then to transform the JavaRDD<Row> to a dataframe (Dataset<Row>), there is a simpler way. You can change the delimiter option when using spark.read to avoid having to use RDDs:

    Dataset<Row> dataframe = spark.read()
        .option("delimiter", "\t")
        .csv("your_path/file.csv");