Search code examples
databrickssftpazure-databricks

Transfer json files from Databricks via SFTP


We need to transfer multiple JSON files from DBX (e.g. abfss://@.dfs.core.windows.net/folder1/json_files) via SFTP.

Is there any sample code/notebook with guidelines which we can follow for this task?

  • Added this jar in cluster library - com.springml:spark-sftp_2.11:1.1.3
  • Cluster runtime - 13.1 (includes Apache Spark 3.4.0, Scala 2.12)

Tired below code:

df.write.format("com.springml.spark.sftp").\
  option("host", hostname).\
  option("username", "user").\
  option("password", "password").\
  option("fileType", "json").\
  save("/ftp/files/sample.json")

And getting this error:

  java.lang.NoSuchMethodError: scala.Predef$.refArrayOps([Ljava/lang/Object;)Lscala/collection/mutable/ArrayOps;

Solution

  • You can use Spark File Transfer Library to transfer the files. com.github.arcizon:spark-filetransfer_2.12:0.3.0 for scala 2.12 First, install this library in your cluster.

    enter image description here

    Using this library you can read and write dataframe in different file formats. Below is the example of reading a text file.

    df_txt = spark.read \
        .format("filetransfer") \
        .option("protocol", "sftp") \
        .option("host", host) \
        .option("port", "22") \
        .option("username", username) \
        .option("password", password) \
        .option("fileFormat", "text") \
        .load("/pub/example/readme.txt")
        
    display(df_txt)
    

    enter image description here

    Similarly, you can write data to sftp as below.

    df = spark.read.json(adls_path)
    
    df.write \
        .format("filetransfer") \
        .option("protocol", "sftp") \
        .option("host", host) \
        .option("port", "22") \
        .option("username", username) \
        .option("password", password) \
        .option("fileFormat", "json") \
        .save("data/upload/output/sample.json")
    

    For more information refer this GitHub repo.