Search code examples
apache-sparktruncateapache-kudu

Truncate Kudu table using Spark


What is the best way to truncate kudu table from spark? Is there any analogue of SQL "TRUNCATE TABLE_NAME;" or "DELETE FROM TALBE_NAME;"?

I just managed to find kuduContext.deleteRows, but it requires explicit specification rows to delete.

Or I should use KuduClient not Spark for such operations?


Solution

  • I couldn't find any operation for truncate table within KuduClient. With kudu delete rows the ids has to be explicitly mentioned.

    The easiest method (with shortest code) to do this as mentioned in the documentaion is read the id (or all the primary keys) as dataframe and pass this to KuduContext.deleteRows.

    import org.apache.kudu.spark.kudu._
    
    val kuduMasters = Seq("kudu_ubuntu:7051").mkString(",")
    val tableName = "test_tbl"
    val kuduContext = new KuduContext(kuduMasters, sc)
    val df = spark.sqlContext.read.
        options(Map("kudu.master" -> kuduMasters,
                     "kudu.table" -> tableName)).
        kudu
    val idToDelete = df.select("no")                // contains ids for existing rows.
    kuduContext.deleteRows(idToDelete, tableName)   // delete rows
    

    Note: I used spark-2 with package org.apache.kudu:kudu-spark2_2.11:1.6.0 for kudu connection