Search code examples
pythonapache-sparkamazon-s3pysparkdatabricks

PySpark and Databricks addFile and SparkFiles.get Exception java.io.FileNotFoundException


I am trying to:

  1. Load an SSL certification from S3 to a cluster.
  2. addFile so all nodes see the file.
  3. Create a connection URL to IBM db2 with JDBC.

Step 1 and Step 2 are working successfully. I can open the file with with open(cert_filepath, "r") as file and print it.

But in Step 3 I get the following error:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3) (10.239.124.48 executor 1): com.ibm.db2.jcc.am.DisconnectNonTransientConnectionException: [jcc][t4][2043][11550][4.27.25] Exception java.io.FileNotFoundException: Error opening socket to server / on port 53,101 with message: /local_disk0/spark-c30d1a7f-deea-4184-a9f9-2b6c9eab6c5e/userFiles-761620cf-2cb1-4623-9677-68694f0e4b3c/dwt01_db2_ssl.arm (No such file or directory). ERRORCODE=-4499, SQLSTATE=08001

The port is 53101 but the error code puts a comma.

The essential part of code:

sc = SparkContext.getOrCreate()
s3_client = boto3.client("s3")
s3_client.download_file(
    Bucket="my-bucket-s3",
    Key="my/path/db2/dwt01_db2_ssl.arm",
    Filename="dwt01_db2_ssl.arm",
)
sc.addFile("dwt01_db2_ssl.arm")
cert_filepath = SparkFiles.get("dwt01_db2_ssl.arm")
user_name = cls.get_aws_secret(secret_name=cls._db2_username_aws_secret_name, key="username", region="eu-central-1")
password = cls.get_aws_secret(secret_name=cls._db2_password_aws_secret_name, key="password", region="eu-central-1")
driver = "com.ibm.db2.jcc.DB2Driver"
jdbc_url = f"jdbc:db2://{hostname}:{port}/{database}:sslConnection=true;sslCertLocation={cert_filepath};"

df = (
    spark.read.format("jdbc")
    .option("driver", driver)
    .option("url", jdbc_url)
    .option("dbtable", f"({query}) as src")
    .option("user", user_name)
    .option("password", password)
    .load()
)

I can't seem to solve what could be the cause for this FileNotFoundException as it is there for at least when reading it and printing it.


Solution

  • You're getting the exception java.io.FileNotFoundException because of the reason mentioned in this comment:

    In cluster mode, a local file, which has not been added to the spark-submit will not be found via addFile. This is because the driver (application master) is started on the cluster and is already running when he reaches the addFile call. It is too late at this point. The application has already been submitted, and the local file system is the file system of a specific cluster node.

    So, to overcome this issue I'll suggest you use a DBFS path (eg: /dbfs/dwt01_db2_ssl.arm) for consistency and access across workers.

    Here's your updated code:

    # sc = SparkContext.getOrCreate()
    s3_client = boto3.client("s3")
    s3_client.download_file(
        Bucket="my-bucket-s3",
        Key="my/path/db2/dwt01_db2_ssl.arm",
        Filename="dwt01_db2_ssl.arm",
    )
    # sc.addFile("dwt01_db2_ssl.arm")
    # cert_filepath = SparkFiles.get("dwt01_db2_ssl.arm")
    cert_filepath = "/dbfs/dwt01_db2_ssl.arm"
    

    Afterward, you'll have to configure spark SSL properties as follows to continue because your code is missing this currently:

    spark.conf.set("spark.ssl", "true")
    spark.conf.set("spark.ssl.trustStore", cert_filepath)
    

    And then the rest of the code will be as follows:

    user_name = cls.get_aws_secret(secret_name=cls._db2_username_aws_secret_name, key="username", region="eu-central-1")
    password = cls.get_aws_secret(secret_name=cls._db2_password_aws_secret_name, key="password", region="eu-central-1")
    driver = "com.ibm.db2.jcc.DB2Driver"
    jdbc_url = f"jdbc:db2://{hostname}:{port}/{database}:sslConnection=true;"
    
    df = (
        spark.read.format("jdbc")
        .option("driver", driver)
        .option("url", jdbc_url)
        .option("dbtable", f"({query}) as src")
        .option("user", user_name)
        .option("password", password)
        .load()
    )
    

    Note: If you notice, I have removed sslCertLocation in jdbc_url because I don't think it's needed anymore as we have already configured spark SSL properties to handle that.

    Now, doing the above changes should ideally resolve the error.

    I suspect another issue might occur because of the .arm file as it's not a standard trust store format. But let me know if you face any problems.