Search code examples
apache-sparkgoogle-cloud-dataproc

Reading CSV file with Spark runs sometimes forever


i'm using Spark 2.4.8 with the gcs-connector from com.google.cloud.bigdataoss in version hadoop2-2.1.8. For development i'm using a Compute Engine VM with my IDE. I try to consume some CSV files from a GCS bucket natively with the Spark .csv(...).load(...) functionality. Some files are loaded successfully, but some are not. Then in the Spark UI i can see that the load job runs forever until a timeout fires.

But the weird thing is, that when i run the same application packaged to a Fat-JAR in Dataproc cluster, all the same files can be consumed successfully.

What i am doing wrong?


Solution

  • @JanOels, As you have mentioned in the comment, using gcs-connector in version hadoop2-2.2.8 will resolve this issue and the latest version of hadoop2 is hadoop2-2.2.10.

    For more information about all the versions of hadoop2 to use gcs-connector from com.google.cloud.bigdataoss this document can be referred.

    Posting the answer as community wiki for the benefit of the community that might encounter this use case in the future.

    Feel free to edit this answer for additional information.