Search code examples
rapache-sparksparkr

How to edit the schema of a SparkDataFrame?


I have a SparkDataFrame on which I want to apply some function using dapply() and add a new column.

dapply in SparkR expect the schema which will match to the output of the called function. e.g.,

#Creating SparkDataFrame

sdf<-as.DataFrame(iris)

#Initiating Schema

schm<-structType(structField("Sepal_Length", "double"),structField("Sepal_Width", "double"),structField("Petal_Length","double"),structField("Petal_Width","double"),structField("Species","string"),structField("Specie_new","string"))

#dapply code
sdf2<-dapply(sdf,function(y)
  {
    y$Specie_new<-substr(y$Specie,nchar(y$Species)-1,nchar(y$Species))
return(y)
},schm)

Is there any better way to do the same? I mean if I have 100 columns then this won't be a feasible option, what should I do in these cases?


Solution

  • Arguably a better way is to avoid dapply for simple cases like this one. You could easily use simple regexp to achieve the same result:

    regexp_extract(df$Species, "^.*(.{2})$", 1)
    

    or a combination of Spark SQL functions (SparkR::substr, SparkR::length).

    Still, you can easily reuse existing schema to create na ew one. Assuming you want to add new field foo:

    foo <- structField("foo", "string")
    

    just extract fields of an existing one and combine them:

    do.call(structType, c(schema(df)$fields(), list(foo)))