I want to convert a Spark dataframe to a dataset of a POJO with different fields names. I have a dataframe of the fields: name
, date_of_birth
, where their types are StringType
, DateType
.
And a POJO of:
public class Person implements Serializable {
private String name;
private Date dateOfBirth;
}
I convert it to dataset successfully with the following code:
Encoder<Person> personEncoder = Encoders.bean(Person.class);
Dataset<Person> personDS = result.as(personEncoder);
List<Person> personList = personDS.collectAsList();
Only if I change the dataframe’s columns names before that, to those of the Person POJO. Is there any way of telling Spark to map between the fields from the POJO side?
I thought about Gson’s @SerializedName(“date_of_birth”)
but it didn’t affect anything.
If you have a name mapping, say in a Map, you could use it to rename the columns before converting the dataframe into a dataset.
It could be written like this:
// I create the map, but it could be read from a config file for instance
Map<String, String> nameMapping = new java.util.HashMap<>();
nameMapping.put("id", "name");
nameMapping.put("date", "dateOfBirth");
Column[] renamedColumns = nameMapping
.entrySet()
.stream()
.map(x -> col(x.getKey()).alias(x.getValue()))
.collect(Collectors.toList())
.toArray(new Column[0]);
result.select(renamedColumns).as(personEncoder)