Search code examples
apache-sparkapache-spark-sqlstring-length

compute string length in Spark SQL DSL


Edit: this is an old question concerning Spark 1.2

I've been trying to compute on the fly the length of a string column in a SchemaRDD for orderBy purposes. I am learning Spark SQL so my question is strictly about using the DSL or the SQL interface that Spark SQL exposes, or to know their limitations.

My first attempt has been to use the integrated relational queries, for instance

notes.select('note).orderBy(length('note))

with no luck at the compilation:

error: not found: value length

(Which makes me wonder where to find what "Expression" this DSL can actually resolve. For instance, it resolves "+" for column additions.)

Then I tried

sql("SELECT note, length(note) as len FROM notes")

This fails with

java.util.NoSuchElementException: key not found: length

(Then I reread this (I'm running 1.2.0) http://spark.apache.org/docs/1.2.0/sql-programming-guide.html#supported-hive-features and wonder in what sense Spark SQL supports the listed hive features.)

Questions: is the length operator really supported in Expressions and/or in SQL statements? If yes, what is the syntax? (bonus: is there a specific documentation about what is resolved in Spark SQL Expressions, and what would be the syntax in general?)

Thanks!


Solution

  • Try this in Spark Shell:

    case class Note(id:Int,text:String)
    val notes=List(Note(1,"One"),Note(2,"Two"),Note(3,"Three"))
    val notesRdd=sc.parallelize(notes)
    import org.apache.spark.sql.hive.HiveContext
    val hc=new HiveContext(sc)
    import hc.createSchemaRDD
    notesRdd.registerTempTable("note")
    hc.sql("select id, text, length(text) from note").foreach(println)
    

    It works on by setup (out of the box spark 1.2.1 with hadoop 2.4):

    [2,Two,3]
    [1,One,3]
    [3,Three,5]