I have a SparkDataFrame on which I want to apply some function using dapply() and add a new column.
dapply in SparkR expect the schema which will match to the output of the called function. e.g.,
#Creating SparkDataFrame
sdf<-as.DataFrame(iris)
#Initiating Schema
schm<-structType(structField("Sepal_Length", "double"),structField("Sepal_Width", "double"),structField("Petal_Length","double"),structField("Petal_Width","double"),structField("Species","string"),structField("Specie_new","string"))
#dapply code
sdf2<-dapply(sdf,function(y)
{
y$Specie_new<-substr(y$Specie,nchar(y$Species)-1,nchar(y$Species))
return(y)
},schm)
Is there any better way to do the same? I mean if I have 100 columns then this won't be a feasible option, what should I do in these cases?
Arguably a better way is to avoid dapply
for simple cases like this one. You could easily use simple regexp to achieve the same result:
regexp_extract(df$Species, "^.*(.{2})$", 1)
or a combination of Spark SQL functions (SparkR::substr
, SparkR::length
).
Still, you can easily reuse existing schema to create na ew one. Assuming you want to add new field foo
:
foo <- structField("foo", "string")
just extract fields of an existing one and combine them:
do.call(structType, c(schema(df)$fields(), list(foo)))