Search code examples
apache-sparkhadooppysparkgoogle-cloud-storagegoogle-cloud-dataproc

Storing source file in Google dataproc HDFS vs google cloud storage(google bucket)


I want to process ~500 GB of data, spread across 64 JSON files each containing 5M records. Basically, Map(Pyspark) function on each of 300M records.

To test my PySpark map function, I have set up a google Dataproc cluster (1 master 5 workers to test just one JSON file).

What is the best practice here?

Should I copy all the files in master node (to make use of Hadoop distributed file system in Dataproc) or will it be equally efficient if I keep the files in my GCS bucket and point the file location in my Pyspark?

Also my code imports quite a few external modules which I have copied to my master and import works fine in master. What is the best practice to copy it over all other worker nodes so that when Pyspark runs in those workers I don't get the import error.

I read a few articles on Google cloud website but did not get a clear answer where to store the files.

I can manually copy the external modules to each of my worker nodes but can't do it in production when I will be dealing with at least 100 nodes.


Solution

  • You're asking several questions, so lets take them one at a time.

    1. my code imports quite a few external modules which I have copied to my master and import works fine in master. What is the best practice to copy it over all other worker nodes so that when Pyspark runs in those workers I don't get the import error.

    If the modules are external (e.g., you install them via pip install) then I'd use an initialization action

    If what you have is a lot of .py files that you wrote, I'd put them in an archive file and pass to your job with --py-files argument. I would also suggest looking into building wheels or eggs.

    You may find this link useful: https://developerzen.com/best-practices-writing-production-grade-pyspark-jobs-cb688ac4d20f

    1. should I copy all the files in master node (to make use of Hadoop distributed file system in Dataproc) or will it be equally efficient if I keep the files in my GCS bucket

    If the data is already in GCS and you intend to store it there, there is no added benefit to copying it down to master node. The GCS connector can read it in place (and in parallel!) from GCS and this may be cheaper (in terms of compute cost) over copying to/from GCS separately.

    It sounds like your data is already decently sharded; this is a good reason to just read it from GCS directly in spark.

    The GCS connector page calls this out explicitly:

    • Direct data access – Store your data in Cloud Storage and access it directly, with no need to transfer it into HDFS first. HDFS compatibility – You can easily access your data in Cloud Storage using the gs:// prefix instead of hdfs://.

    • Interoperability – Storing data in Cloud Storage enables seamless interoperability between Spark, Hadoop, and Google services.

    • No storage management overhead – Unlike HDFS, Cloud Storage requires no routine maintenance such as checking the file system, upgrading or rolling back to a previous version of the file system, etc.

    • Quick startup – In HDFS, a MapReduce job can't start until the NameNode is out of safe mode—a process that can take from a few seconds to many minutes depending on the size and state of your data. With Cloud Storage, you can start your job as soon as the task nodes start, leading to significant cost savings over time.