Search code examples
apache-sparkapache-spark-dataset

Java Spark Dataset MapFunction - Task not serializable without any reference to class


I have a following class that reads csv data into Spark's Dataset. Everything works fine if I just simply read and return the data.

However, if I apply a MapFunction to the data before returning from function, I get

Exception in thread "main" org.apache.spark.SparkException: Task not serializable

Caused by: java.io.NotSerializableException: com.Workflow.

I know Spark's working and its need to serialize objects for distributed processing, however, I'm NOT using any reference to Workflow class in my mapping logic. I'm not calling any Workflow class function in my mapping logic. So why is Spark trying to serialize Workflow class? Any help will be appreciated.

public class Workflow {

    private final SparkSession spark;   

    public Dataset<Row> readData(){
        final StructType schema = new StructType()
            .add("text", "string", false)
            .add("category", "string", false);

        Dataset<Row> data = spark.read()
            .schema(schema)
            .csv(dataPath);

        /* 
         * works fine till here if I call
         * return data;
         */

        Dataset<Row> cleanedData = data.map(new MapFunction<Row, Row>() {
            public Row call(Row row){
                /* some mapping logic */
                return row;
            }
        }, RowEncoder.apply(schema));

        cleanedData.printSchema();
        /* .... ERROR .... */
        cleanedData.show();

        return cleanedData;
    }
}

Solution

  • anonymous inner classes have a hidden/implicit reference to enclosing class. use Lambda expression or go with Roma Anankin's solution