Search code examples
hadoopgoogle-cloud-storagedistcp

Key file distribution in Hadoop cluster


I want to send a lot of files from HDFS to Google Storage (GS). So I want to use distcp command this this case.

hadoop distcp -libjars <full path to connector jar> -m <amount of mappers> hdfs://<host>:<port(default 8020)>/<hdfs path> gs://<backet name>/

Also I need to specify *.p12 key file in core-site.xml to access to GS. And I need to distribute this file to all nodes in my cluster.

<property>
    <name>google.cloud.auth.service.account.keyfile</name>
    <value>/opt/hadoop/conf/gcskey.p12</value>
</property>

I do not want to do it manually. What is the best practise to distibute the key file?


Solution

  • There is a generic parameter

    -files <comma separated list of files> specify comma separated files to be copied to the map reduce cluster
    

    Command will be

    hadoop distcp -libjars <full path to connector jar> -files /etc/hadoop/conf/gcskey.p12 -m <amount of mappers>  hdfs://<host>:<port(default 8020)>/<hdfs path> gs://<backet name>/
    

    NOTE1 In this case we will need to set up key path (google.cloud.auth.service.account.keyfile) on core-site.xml as in the example below

    NOTE2 You need to have .p12 key file at current directory, because haddop checks paths from core-site on start.

    <property>
        <name>fs.gs.impl</name>
        <value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem</value>
    </property>
    <property>
        <name>fs.AbstractFileSystem.gs.impl</name>
        <value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS</value>
        <description>
            The AbstractFileSystem for gs: (GCS) uris. Only necessary for use with Hadoop 2.
        </description>
    </property>
    <property>
        <name>fs.gs.project.id</name>
        <value>google project id</value>
        <description>Google Project Id</description>
    </property>
    <property>
        <name>google.cloud.auth.service.account.enable</name>
        <value>true</value>
    </property>
    <property>
        <name>google.cloud.auth.service.account.email</name>
        <value>google service account email</value>
        <description>Project service account email</description>
    </property>
    <property>
        <name>google.cloud.auth.service.account.keyfile</name>
        <value>gcskey.p12</value>
        <description>Local path to .p12 key at each node</description>
    </property>