Search code examples
apache-sparkpysparkspark-structured-streamingdelta-lake

Different clusters Spark Structured Streaming from delta file on cluster A to cluster B


I am trying to stream a delta table from cluster A to Cluster B, but I am not able to load or write data to a different cluster:

streamingDf = spark.readStream.format("delta").option("ignoreChanges", "true") \
              .load("hdfs://cluster_A/delta-table")

stream = streamingDf.writeStream.format("delta").option("checkpointLocation", "/tmp/checkpoint")\
         .start("hdfs://cluster_B/delta-sink")

Then, I get the following error:
org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block

So, my question is if it is posiible to stream data directly from two clusters using delta format, or additional technologies are requiered to achieve this.

Thanks!


Solution

  • The error was related to firewall rules, all the nodes in cluster A must have access to all the nodes in cluster B with the corresponding ports. I had only set the ports on the namenodes