Search code examples
rapache-sparkdataframesubstrsparkr

Use of substr() on DataFrame column in SparkR


I am using SparkR and want to use the substr() command to isolate the last character of a string that is contained in a column. I can get substr() to work if I set the StartPosition and EndPosition to a constant:

substr(sdfIris$Species, 8, 8)

But when I try to set these parameters using a value sourced from the DataFrame:

sdfIris <- createDataFrame(sqlContext, iris)
sdfIris$Len <- length(sdfIris$Species)
sdfIris$Last <- substr(sdfIris$Species, sdfIris$Len, sdfIris$Len)

Error in as.integer(start - 1) : cannot coerce type 'S4' to vector of type 'integer'

It seems that the result being returned from sdfIris$Len is perhaps a one-cell DataFrame, and the parameter needs an integer.

I have tried collect(sdfIris$Len), but:

Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘collect’ for signature ‘"Column"’

This seems incongruous. substr() seems to see sdfIris$Len as a DataFrame, but collect() seems to see it as a Column.

I have already identified a work-around by using registerTempTable and using SparkSQL's substr to isolate the last character, but I was hoping to avoid the unnecessary steps of switching to SQL.

How can I use SparkR substr() on a DataFrame column with dynamic Start and Finish parameters?


Solution

  • It is not optimal but you can use expr:

    df <- createDataFrame(
      sqlContext,
      data.frame(s=c("foo", "bar", "foobar"), from=c(1, 2, 0), to=c(2, 3, 5))
    )
    
    select(df, expr("substr(s, from, to)")) %>% head()
    
    ##   substr(s,from,to)
    ## 1                fo
    ## 2                ar
    ## 3             fooba
    

    or selectExpr:

    selectExpr(df, "substr(s, from, to)") %>% head()
    
    ##   substr(s,from,to)
    ## 1                fo
    ## 2                ar
    ## 3             fooba
    

    as well as equivalent SQL query.