Search code examples
javaapache-sparkapache-spark-sqlpojoapache-spark-dataset

Convert Spark DF to a DS with different fields names


I want to convert a Spark dataframe to a dataset of a POJO with different fields names. I have a dataframe of the fields: name, date_of_birth, where their types are StringType, DateType.

And a POJO of:

public class Person implements Serializable {
    private String name;
    private Date dateOfBirth;
}

I convert it to dataset successfully with the following code:

Encoder<Person> personEncoder =  Encoders.bean(Person.class); 
Dataset<Person> personDS = result.as(personEncoder);
List<Person> personList = personDS.collectAsList();

Only if I change the dataframe’s columns names before that, to those of the Person POJO. Is there any way of telling Spark to map between the fields from the POJO side?

I thought about Gson’s @SerializedName(“date_of_birth”) but it didn’t affect anything.


Solution

  • If you have a name mapping, say in a Map, you could use it to rename the columns before converting the dataframe into a dataset.

    It could be written like this:

    // I create the map, but it could be read from a config file for instance
    Map<String, String> nameMapping = new java.util.HashMap<>();
    nameMapping.put("id", "name");
    nameMapping.put("date", "dateOfBirth");
    
    Column[] renamedColumns = nameMapping
                    .entrySet()
                    .stream()
                    .map(x -> col(x.getKey()).alias(x.getValue()))
                    .collect(Collectors.toList())
                    .toArray(new Column[0]);
    
    result.select(renamedColumns).as(personEncoder)