Search code examples
razure-databrickssparkrstrsplit

How to use strsplit on SparkDataFrame


I am taking my first steps in the Azure Databricks world and therefore I have to learn how to use SparkR.

[I am coming from data.table]

Although I have read a lot of documentation, I think something escapes me on SparkDataFrame.

To create a new column, I learned that we can do something like :

sdf$new <- sdf$old * 0.5

But if I want to use a basic function, I got an error and I can't figure out why :

sdf <- sql("select * from database.table")
sdf$new <- strsplit(sdf$old, "-")[1]

Error in strsplit((sdf$old), "-") : 
  non-character argument
Some(<code style = 'font-size:10p'> Error in strsplit((sdf$old), &quot;-&quot;): non-character argument </code>)

What am I missing ?

Thanks.


Solution

  • Instead of strsplit you need to use Spark specific functions that you can find in the Spark R API documentation. Specifically, you need to use split_string function, combined with getItem function (please note that you need to specify L to force number be an integer):

    new_df <- withColumn(sdf, "new_id", getItem(split_string(sdf$old, ","), 0L))