I'm creating a RDD in 1st part of the application, then converting it to a list using rdd.collect().
But for some reason the list size is coming as 0 in the second part of the application , while the RDD from which I'm creating the list is not empty.Even rdd.toArray() is giving empty list.
Below is my program.
public class Query5kPids implements Serializable{
List<String> ListFromS3 = new ArrayList<String>();
public static void main(String[] args) throws JSONException, IOException, InterruptedException, URISyntaxException {
SparkConf conf = new SparkConf();
conf.setAppName("Spark-Cassandra Integration");
conf.set("spark.cassandra.connection.host", "12.16.193.19");
conf.setMaster("yarn-cluster");
SparkConf conf1 = new SparkConf().setAppName("SparkAutomation").setMaster("yarn-cluster");
Query5kPids app1 = new Query5kPids(conf1);
app1.run1(file);
Query5kPids app = new Query5kPids(conf);
System.out.println("Both RDD has been generated");
app.run();
}
private void run() throws JSONException, IOException, InterruptedException {
JavaSparkContext sc = new JavaSparkContext(conf);
query(sc);
sc.stop();
}
private void run1(File file) throws JSONException, IOException, InterruptedException {
JavaSparkContext sc = new JavaSparkContext(conf);
getData(sc,file);
sc.stop();
}
private void getData(JavaSparkContext sc, File file) {
JavaRDD<String> Data = sc.textFile(file.toString());
System.out.println("RDD Count is " + Data.count());
// here it prints some count value
ListFromS3 = Data.collect();
// ListFromS3 = Data.toArray();
}
private void query(JavaSparkContext sc) {
System.out.println("RDD Count is " + ListFromS3.size());
// Prints 0
// So cant convert the list to RDD
JavaRDD<String> rddFromGz = sc.parallelize(ListFromS3);
}
}
NOTE -> In the actual program , the RDD and List is of type.
List<UserSetGet> ListFromS3 = new ArrayList<UserSetGet>();
JavaRDD<UserSetGet> Data = new ....
where UserSetGet is a Pojo , With Setter and getter methods, and its Serializable.
app1.run1
puts the RDD contents into app1.ListFromS3
. Then you look at app.ListFromS3
, which is empty. app1.ListFromS3
and app.ListFromS3
are fields on two different objects. Setting one does not set the other.
I think you meant ListFromS3
to be static
, meaning it belongs to the Query5kPids
class, not to a particular instance. Like this:
static List<String> ListFromS3 = new ArrayList<String>();