I am reading a txt file as a JavaRDD with the following command:
JavaRDD<String> vertexRDD = ctx.textFile(pathVertex);
Now, I would like to convert this to a JavaRDD because in that txt file I have two columns of Integers and want to add some schema to the rows after splitting the columns.
I tried also this:
JavaRDD<Row> rows = vertexRDD.map(line -> line.split("\t"))
But is says I cannot assign the map function to an "Object" RDD
Thanks!
Creating a JavaRDD
out of another is implicit when you apply a transformation such as map
. Here, the RDD you create is a RDD of arrays of strings (result of split
).
To get a RDD of rows, just create a Row from the array:
JavaRDD<String> vertexRDD = ctx.textFile("");
JavaRDD<String[]> rddOfArrays = vertexRDD.map(line -> line.split("\t"));
JavaRDD<Row> rddOfRows =rddOfArrays.map(fields -> RowFactory.create(fields));
Note that if your goal is then to transform the JavaRDD<Row>
to a dataframe (Dataset<Row>
), there is a simpler way. You can change the delimiter option when using spark.read
to avoid having to use RDDs:
Dataset<Row> dataframe = spark.read()
.option("delimiter", "\t")
.csv("your_path/file.csv");