Search code examples
javaapache-sparkcassandraspark-cassandra-connector

Getting empty Java List after converting from RDD


I'm creating a RDD in 1st part of the application, then converting it to a list using rdd.collect().

But for some reason the list size is coming as 0 in the second part of the application , while the RDD from which I'm creating the list is not empty.Even rdd.toArray() is giving empty list.

Below is my program.

 public class Query5kPids implements Serializable{

 List<String> ListFromS3 = new ArrayList<String>();

 public static void main(String[] args) throws JSONException, IOException, InterruptedException, URISyntaxException {


        SparkConf conf = new SparkConf();
        conf.setAppName("Spark-Cassandra Integration");
        conf.set("spark.cassandra.connection.host", "12.16.193.19");
        conf.setMaster("yarn-cluster");

        SparkConf conf1 = new SparkConf().setAppName("SparkAutomation").setMaster("yarn-cluster");

        Query5kPids app1 = new Query5kPids(conf1);
        app1.run1(file);

        Query5kPids app = new Query5kPids(conf);
        System.out.println("Both RDD has been generated");
        app.run();

}

private void run() throws JSONException, IOException, InterruptedException {

        JavaSparkContext sc = new JavaSparkContext(conf);
        query(sc);
        sc.stop();
}

private void run1(File file) throws JSONException, IOException, InterruptedException {
         JavaSparkContext sc = new JavaSparkContext(conf);
         getData(sc,file);
         sc.stop();

}

    private void getData(JavaSparkContext sc, File file) {

         JavaRDD<String> Data = sc.textFile(file.toString());
         System.out.println("RDD Count is " + Data.count());
         // here it prints some count value
         ListFromS3 = Data.collect();
         // ListFromS3 = Data.toArray();

    }
     private void query(JavaSparkContext sc) {

         System.out.println("RDD Count is " + ListFromS3.size());
         // Prints 0
         // So cant convert the list to RDD
         JavaRDD<String> rddFromGz = sc.parallelize(ListFromS3);

    }


  }

NOTE -> In the actual program , the RDD and List is of type.

List<UserSetGet> ListFromS3 = new ArrayList<UserSetGet>();
JavaRDD<UserSetGet> Data = new ....

where UserSetGet is a Pojo , With Setter and getter methods, and its Serializable.


Solution

  • app1.run1 puts the RDD contents into app1.ListFromS3. Then you look at app.ListFromS3, which is empty. app1.ListFromS3 and app.ListFromS3 are fields on two different objects. Setting one does not set the other.

    I think you meant ListFromS3 to be static, meaning it belongs to the Query5kPids class, not to a particular instance. Like this:

    static List<String> ListFromS3 = new ArrayList<String>();