Search code examples
apache-sparkimpalaapache-kudu

SPARK KUDU Complex Update statements directly or via Impala JDBC Driver possible?


If I look at the Imapala Shell or Hue, I can write complicated enough IMPALA update statements for KUDU. E.g. update with sub-select and what not. Fine.

Looking at the old JDBC connection methods for, say, mySQL via SPARK / SCALA, there is not a lot of possibility to do a complicated update via such a connection, and that is understandable. However, with KUDU, I think the situation changes.

Looking at the documentation on KUDU - Apache KUDU - Developing Applications with Apache KUDU, the follwoing questions:

  1. It is unclear if I can issue a complex update SQL statement from a SPARK / SCALA environment via an IMPALA JDBC Driver (due to security issues with KUDU).
  2. In SPARK KUDU Native Mode DML seems tied to a Data Frame approach with INSERT and UPSERT. What if I just want to write a free-format SQL DML statement like an UPDATE? I see that we can use Spark SQL to INSERT (treated as UPSERT by default) into a Kudu table. E.g.

    sqlContext.sql(s"INSERT INTO TABLE $kuduTableName SELECT * FROM source_table")
    
  3. My understanding with SPARK SQL INSERT ... as per above is that the KUDU table must be a temporary table as well. I cannot approach it directly. So, taking this all in then, how can we approach a KUDU table directly in SPARK? We cannot in SPARK / KUDU, and complicated UPDATES statement via SPARK SCALA / KUDU or SPARK SCALA to KUDU via Impala JDBC connection do not allow this either. I can do some things via shell scripting with saved env vars in some cases I note.

Solution

  • What a bad documentation in this regard.

    DML insert, update, ... possible via the "approach" below, some examples:

    stmt.execute("update KUDU_1 set v = 'same value' where k in ('1', '4')  ;") 
    stmt.execute("insert into KUDU_1 select concat(k, 'ABCDEF'), 'MASS INSERT' from KUDU_1  ;") 
    

    The only thing if using the corresponding stmt.executequery a Java resultset is returned which differs to the more standard approach of reading from JDBC sources and persisting the results. A little surprise here for me. Maybe 2 approaches needed, one for more regular selects and one work DML non-select. Not sure if that can be all in the same programme module. For another time. Yes it can.