Can't read files via FTP using SparkContext.textFile(...) on Google Dataproc

I am running a Spark cluster on Google Dataproc and I'm experiencing some issues while trying to read GZipped file from FTP using sparkContext.textFile(...).

The code I am running is:

object SparkFtpTest extends App {
  val file = "ftp://username:password@host:21/filename.txt.gz"
  val lines = sc.textFile(file)
  lines.saveAsTextFile("gs://my-bucket-storage/tmp123")
}

The error that I get is:

Exception in thread "main" org.apache.commons.net.ftp.FTPConnectionClosedException: Connection closed without indication.

I see some people have suggested that the credentials are wrong, so I've tried entering wrong credentials and the error was different, i.e. Invalid login credentials.

It also works if I copy the URL into the browser - the file is being downloaded properly.

It's also worth mentioning that I've tried using Apache commons-net library (the same version as the one in Spark - 2.2) and it worked - I was able to stream the data (from both Master and Worker nodes). I wasn't able to decompress it though (by using Java's GZipInputStream; I can't remember the failure but if you think it's important I can try and reproduce it). I think this suggests that it's not some firewall issue on the cluster, though I wasn't able to use curl to download the file.

I think I was running the same code a few months ago from my local machine and if I remember correctly it worked just fine.

Do you have any ideas what is causing this problem? Could it be that it's some kind of dependency conflict problem and if so which one?

I have a couple of dependencies in the project such as google-sdk, solrj, ... However, I'd expect to see something like ClassNotFoundException or NoSuchMethodError if it was a dependency problem.

The whole stack trace looks like this:

16/12/05 23:53:46 INFO com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage: Populating missing itemInfo on-demand for entry: gs://my-bucket-storage/tmp123/_temporary/
16/12/05 23:53:47 WARN com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage: Possible stale CacheEntry; failed to fetch item info for: gs://my-bucket-storage/tmp123/_temporary/ - removing from cache
16/12/05 23:53:49 INFO com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage: Populating missing itemInfo on-demand for entry: gs://my-bucket-storage/tmp123/_temporary/0/
16/12/05 23:53:50 WARN com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage: Possible stale CacheEntry; failed to fetch item info for: gs://my-bucket-storage/tmp123/_temporary/0/ - removing from cache
16/12/05 23:53:50 INFO com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage: Populating missing itemInfo on-demand for entry: gs://my-bucket-storage/tmp123/_temporary/
16/12/05 23:53:51 WARN com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage: Possible stale CacheEntry; failed to fetch item info for: gs://my-bucket-storage/tmp123/_temporary/ - removing from cache
Exception in thread "main" org.apache.commons.net.ftp.FTPConnectionClosedException: Connection closed without indication.
    at org.apache.commons.net.ftp.FTP.__getReply(FTP.java:298)
    at org.apache.commons.net.ftp.FTP.sendCommand(FTP.java:495)
    at org.apache.commons.net.ftp.FTP.sendCommand(FTP.java:537)
    at org.apache.commons.net.ftp.FTP.sendCommand(FTP.java:586)
    at org.apache.commons.net.ftp.FTP.quit(FTP.java:794)
    at org.apache.commons.net.ftp.FTPClient.logout(FTPClient.java:788)
    at org.apache.hadoop.fs.ftp.FTPFileSystem.disconnect(FTPFileSystem.java:151)
    at org.apache.hadoop.fs.ftp.FTPFileSystem.getFileStatus(FTPFileSystem.java:395)
    at org.apache.hadoop.fs.FileSystem.globStatusInternal(FileSystem.java:1701)
    at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1647)
    at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:222)
    at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
    at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
    at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
    at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1906)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1219)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1161)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1161)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
    at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1161)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply$mcV$sp(PairRDDFunctions.scala:1064)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1030)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1030)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
    at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:1030)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply$mcV$sp(PairRDDFunctions.scala:956)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:956)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:956)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
    at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:955)
    at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply$mcV$sp(RDD.scala:1459)
    at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1438)
    at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1438)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
    at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1438)

Solution

It looks like this may be a known unresolved issue in Spark/Hadoop: https://issues.apache.org/jira/browse/HADOOP-11886 and https://github.com/databricks/learning-spark/issues/21 both allude to a similar stack trace.

If you were able to manually use the Apache commons-net library, you could achieve the same effect as sc.textFile by obtaining a list of the files, parallelizing that list of files as an RDD, and using flatMap where each task takes a filename and reads the file line-by-line, generating the output collection of lines for each file.

Alternatively, if the amount of data you have in FTP is small (up to maybe 10 GB or so) then parallel reads won't be helping too much compared to a single thread copying from your FTP server onto HDFS or GCS in your Dataproc cluster before then processing using an HDFS or GCS path in your Spark job.