Search code examples
google-cloud-storagegoogle-hadoop

Deleted google storage directory appears "already exists" when calling Spark DataFrame.saveAsParquetFile()


After I deleted a Google Cloud Storage directory through the Google Cloud Console, (the directory was generated by early Spark (ver 1.3.1) job), when re-run the job, it always fail and seemed the directory was still there to the job; I cannot find the directory with gsutil.

Is this a bug, or anything I missed? Thanks!

The error I got:

java.lang.RuntimeException: path gs://<my_bucket>/job_dir1/output_1.parquet already exists.
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.parquet.DefaultSource.createRelation(newParquet.scala:112)
at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:240)
at org.apache.spark.sql.DataFrame.save(DataFrame.scala:1196)
at org.apache.spark.sql.DataFrame.saveAsParquetFile(DataFrame.scala:995)
at com.xxx.Job1$.execute(Job1.scala:64)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Solution

  • It appears you might be running into a known bug with the NFS list-consistency cache: https://github.com/GoogleCloudPlatform/bigdata-interop/issues/5

    It was fixed in the latest release, and if you upgrade by deploying a new cluster with bdutil-1.3.1 (announced here: https://groups.google.com/forum/#!topic/gcp-hadoop-announce/vstNuV0LpDc) the problem should be fixed. If you need to upgrade in-place, you can try to download the latest gcs-connector-1.4.1 jarfile onto your master and worker nodes under /home/hadoop/hadoop-install/lib/gcs-connector-*.jar and then rebooting the Spark daemons:

    sudo sudo -u hadoop /home/hadoop/spark-install/sbin/stop-all.sh
    sudo sudo -u hadoop /home/hadoop/spark-install/sbin/start-all.sh