java apache-spark apache-spark-sql pojo apache-spark-dataset

Convert Spark DF to a DS with different fields names

I want to convert a Spark dataframe to a dataset of a POJO with different fields names. I have a dataframe of the fields: name, date_of_birth, where their types are StringType, DateType.

And a POJO of:

public class Person implements Serializable {
    private String name;
    private Date dateOfBirth;
}

I convert it to dataset successfully with the following code:

Encoder<Person> personEncoder =  Encoders.bean(Person.class); 
Dataset<Person> personDS = result.as(personEncoder);
List<Person> personList = personDS.collectAsList();

Only if I change the dataframe’s columns names before that, to those of the Person POJO. Is there any way of telling Spark to map between the fields from the POJO side?

I thought about Gson’s @SerializedName(“date_of_birth”) but it didn’t affect anything.

Solution

If you have a name mapping, say in a Map, you could use it to rename the columns before converting the dataframe into a dataset.

It could be written like this:

// I create the map, but it could be read from a config file for instance
Map<String, String> nameMapping = new java.util.HashMap<>();
nameMapping.put("id", "name");
nameMapping.put("date", "dateOfBirth");

Column[] renamedColumns = nameMapping
                .entrySet()
                .stream()
                .map(x -> col(x.getKey()).alias(x.getValue()))
                .collect(Collectors.toList())
                .toArray(new Column[0]);

result.select(renamedColumns).as(personEncoder)