Search code examples
databricksspark-structured-streaming

Data source V2 streaming is not supported on table acl or credential passthrough clusters


I'm using Databricks and this pyspark code:

kafka = spark.readStream\
    .format("kafka")\
    .option("kafka.sasl.mechanism", "SCRAM-SHA-512")\
    .option("kafka.security.protocol", "SASL_SSL")\
    .option("kafka.sasl.jaas.config", f'org.apache.kafka.common.security.scram.ScramLoginModule required  username="{user_stg}" password="{pass_stg}"')\
    .option("kafka.bootstrap.servers", "b-1.dataservices-msk-st.****.amazonaws.com:9096")\
    .option("subscribe", "app-***-events")\
    .option("startingOffsets", "earliest").load()

returns this error:

     Java.lang.SecurityException: Data source V2 streaming is not supported on table acl or credential passthrough clusters. 
     StreamingRelationV2 org.apache.spark.sql.kafka010.KafkaSourceProvider@11002bae, kafka, org.apache.spark.sql.kafka010.KafkaSourceProvider$KafkaTable@35ae434, 
     [kafka.sasl.mechanism=SCRAM-SHA-512, subscribe=app--events, kafka.sasl.jaas.config=*********(redacted), kafka.bootstrap.servers=b-1.dataservices-msk-st.****.amazonaws.com:9096, startingOffs

What is going on? Any option to fix it?


Solution

  • Unfortunately, clusters with Table Access Control Lists (TACL) enabled have limitations on what could be executed on them. Some of the things are explicitly listed in documentation, but in general, you're limited to performing transformations using Spark APIs on the data obtained by reading a table registered in Hive metastore. Usually clusters with TACLs or Databricks SQL are used for data analysis, etc., not for ETL jobs.

    Alternatives would be:

    • Run your ETL on cluster without TACL - it's a standard practice to run ETL on a separate cluster and produce tables that will be consumed by users
    • Switch to Unity Catalog that has less restrictions than TACL