I'm using Databricks and this pyspark code:
kafka = spark.readStream\
.format("kafka")\
.option("kafka.sasl.mechanism", "SCRAM-SHA-512")\
.option("kafka.security.protocol", "SASL_SSL")\
.option("kafka.sasl.jaas.config", f'org.apache.kafka.common.security.scram.ScramLoginModule required username="{user_stg}" password="{pass_stg}"')\
.option("kafka.bootstrap.servers", "b-1.dataservices-msk-st.****.amazonaws.com:9096")\
.option("subscribe", "app-***-events")\
.option("startingOffsets", "earliest").load()
returns this error:
Java.lang.SecurityException: Data source V2 streaming is not supported on table acl or credential passthrough clusters.
StreamingRelationV2 org.apache.spark.sql.kafka010.KafkaSourceProvider@11002bae, kafka, org.apache.spark.sql.kafka010.KafkaSourceProvider$KafkaTable@35ae434,
[kafka.sasl.mechanism=SCRAM-SHA-512, subscribe=app--events, kafka.sasl.jaas.config=*********(redacted), kafka.bootstrap.servers=b-1.dataservices-msk-st.****.amazonaws.com:9096, startingOffs
What is going on? Any option to fix it?
Unfortunately, clusters with Table Access Control Lists (TACL) enabled have limitations on what could be executed on them. Some of the things are explicitly listed in documentation, but in general, you're limited to performing transformations using Spark APIs on the data obtained by reading a table registered in Hive metastore. Usually clusters with TACLs or Databricks SQL are used for data analysis, etc., not for ETL jobs.
Alternatives would be: