Search code examples
apache-sparkpysparkjdbcapache-spark-sqlview

Spark SQL persistent view over jdbc data source


I want to create a persistent (global) view in spark sql that gets data from an underlying jdbc database connection. It works fine when I use a temporary (session-scoped) view as shown below but fails when trying to create a regular (persistent and global) view.

I don't understand why the latter should not work but couldn't find any docs/hints as all examples are always done with temporary views. Technically, I cannot see why it shouldn't work as the data is properly retrieved from jdbc source in the temporary view and thus it should not matter if I wanted to "store" the query in a persistent view so that whenever calling the view it would retrieve data directly from jdbc source.

Config.

tbl_in = myjdbctable
tbl_out = myview
db_user = 'myuser'
db_pw = 'mypw'
jdbc_url = 'jdbc:sqlserver://myserver.domain:1433;database=mydb'

This works.

query = f"""
create or replace temporary view {tbl_out}
using jdbc
options(
  dbtable '{tbl_in}',
  user '{db_user}',
  password '{db_pw}',
  url '{jdbc_url}'
  )
"""

spark.sql(query)

> DataFrame[]

This does not work.

query = f"""
create or replace view {tbl_out}
using jdbc
options(
  dbtable '{tbl_in}',
  user '{db_user}',
  password '{db_pw}',
  url '{jdbc_url}'
  )
"""

spark.sql(query)

> ParseException:

Error.

ParseException: 
mismatched input 'using' expecting {'(', 'UP_TO_DATE', 'AS', 'COMMENT', 'PARTITIONED', 'TBLPROPERTIES'}(line 3, pos 0)

== SQL ==

create or replace view myview
using jdbc
^^^
options(
  dbtable 'myjdbctable',
  user 'myuser',
  password '[REDACTED]',
  url 'jdbc:sqlserver://myserver.domain:1433;database=mydb'
  )

Solution

  • TL;DR: A spark sql table over jdbc source behaves like a view and so can be used like one.

    It seems my assumptions about jdbc tables in spark sql were flawed. It turns out that a sql table with a jdbc source (i.e. created via using jdbc) is actually a live query against the jdbc source (and not a one-off jdbc query during table creation as I assumed). In my mind it actually behaves like a view then. That means if the underlying jdbc source changes (e.g. new entries in a column) this is reflected in the spark sql table on read (e.g. select from) without having to re-create the table.

    It follows that the spark sql table over jdbc source satisfies my requirements of having an always up2date reflection of the underlying table/sql object in the jdbc source. Usually, I would use a view for that. Maybe this is the reason why there is no persistent view over a jdbc source but only temporary views (which of course still make sense as they are session-scoped). It should be noted that the spark sql jdbc table behaves like a view which may be surprising, in particular:

    • if you add a column in underlying jdbc table, it will not show up in spark sql table
    • if you remove a column from underlying jdbc table, an error will occur when spark sql table is accessed (assuming the removed column was present during spark sql table creation)
    • if you remove the underlying jdbc table, an error will occur when spark sql table is accessed