Search code examples
hadoophdfsgoogle-cloud-storagegoogle-cloud-dataprocgoogle-hadoop

How to speed up distcp when transferring data from Hadoop to Google Cloud Storage


The google cloud provides connectors for working with Hadoop.(https://cloud.google.com/hadoop/google-cloud-storage-connector)

Using the connector, I receive data from hdfs to google cloud storage

ex)

hadoop discp hdfs://${path} gs://${path}

but data is too large(16TB) and receive speed is just 2mb/s

So, I try to change set up distcp ( map property, bandwith property ... )

However speed is same.

How to speed up distcp when transferring data from HDFS to Google Cloud Storage


Solution

  • The official documentation states that the one of the best options of transferring data from on-premises clusters to GCP is using a VPN tunnel over the internet or even using multiple VPN tunnels for additional bandwidth.

    Other options proposed are using direct peering between Google's edge points of presence (PoPs) and your network, or establishing a direct connection to Google's network with the help of a Cloud Interconnect service provider.