Search code examples
apache-sparkpyspark

Are downloads from spark distribution archive often slow?


I was trying to download spark-hadoop distribution from the website - https://archive.apache.org/dist/spark/spark-3.1.2/. Often I find that the downloads from this site are generally slow. Is it due to some generic issue with the site itself?

That the download is slow I have verified in two ways -

  • In Colab I have run the command !wget -q https://archive.apache.org/dist/spark/spark-3.1.2/spark-3.1.2-bin-hadoop3.2.tgz which keeps running often for more than say 10 minutes. While at other times it executes within 1 minute.
  • From the website I tried downloading it and even then the download speed is extremely slow occasionally.

Solution

  • It maybe because

    • You download multiple times
    • You download from non-browser, for example curl/wget
    • Your location is physically far from file server or network is unstable.
    • or something else. for example file server is slow

    I think most of public server has kind a "safe guard" to prevent DDoS, So their "Safe guard" control download traffic per sec. I faced similar issue, when I download from browser, it took 1min, but It took 10min when I use curl