I am using SparkR and want to use the substr()
command to isolate the last character of a string that is contained in a column. I can get substr()
to work if I set the StartPosition and EndPosition to a constant:
substr(sdfIris$Species, 8, 8)
But when I try to set these parameters using a value sourced from the DataFrame:
sdfIris <- createDataFrame(sqlContext, iris)
sdfIris$Len <- length(sdfIris$Species)
sdfIris$Last <- substr(sdfIris$Species, sdfIris$Len, sdfIris$Len)
Error in as.integer(start - 1) : cannot coerce type 'S4' to vector of type 'integer'
It seems that the result being returned from sdfIris$Len
is perhaps a one-cell DataFrame, and the parameter needs an integer.
I have tried collect(sdfIris$Len)
, but:
Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘collect’ for signature ‘"Column"’
This seems incongruous. substr()
seems to see sdfIris$Len
as a DataFrame, but collect()
seems to see it as a Column.
I have already identified a work-around by using registerTempTable
and using SparkSQL's substr to isolate the last character, but I was hoping to avoid the unnecessary steps of switching to SQL.
How can I use SparkR substr()
on a DataFrame column with dynamic Start and Finish parameters?
It is not optimal but you can use expr
:
df <- createDataFrame(
sqlContext,
data.frame(s=c("foo", "bar", "foobar"), from=c(1, 2, 0), to=c(2, 3, 5))
)
select(df, expr("substr(s, from, to)")) %>% head()
## substr(s,from,to)
## 1 fo
## 2 ar
## 3 fooba
or selectExpr
:
selectExpr(df, "substr(s, from, to)") %>% head()
## substr(s,from,to)
## 1 fo
## 2 ar
## 3 fooba
as well as equivalent SQL query.