Search code examples
javaapache-sparkamazon-s3apache-spark-sqlhadoop-yarn

Spark - Remove column from bean before writing in partitions


I have PersonBean which has City, Bday and a MetadataJson member variables.

I want to write partitioned data by bday and city. Partitioning by City and bday can be toggle on/off.

All works good if I am partitioning by both bday and city together. I can write MetadataJson in a text format.

But in cases where lets say City is toggled OFF, City is blank in my PersonBean(as expected) so I get an error -

org.apache.spark.sql.AnalysisException: Text data source supports only a single column, and you have 2 columns.;

When I write as CSV format, the same dataset, writes a blank 2nd column. Is there a way to remove the column for the writing as "text" format?

I don't want to create 3 separate beans for all combinations of partitions in my expected format.

1Bean- bday and MetadataJson
2Bean- City and MetadataJson
3Bean- bday and City and MetadataJson




JavaRDD<PersonBean> rowsrdd = jsc.parallelize(dataList);
        SparkSession spark = new SparkSession(
                JavaSparkContext.toSparkContext(jsc));
        Dataset<Row> beanDataset = spark.createDataset(data.rdd(), Encoders.bean(PersonBean.class));;
        String[] partitionColumns = new String[]{"City"}

    beanDataset.write()
            .partitionBy(partitionColumns)
            .mode(SaveMode.Append)
            .option("escape", "")
            .option("quote", "")
            .format("text")
            .save("outputpath");

Solution

  • I used a "beanDataset.select("bday","MetadataJson") call before writing the bean. This way I can use same bean for different combinations of partitioning columns.