Search code examples
hadoophdfshadoop3erasure-code

How to configure the erasure coding feature in hadoop3 and is it used for storing cold files only by default?


As per the Hadoop 3.x release notes, they have introduced Erasure coding to overcome the problems with storage.

Erasure coding is a method for durably storing data with significant space savings compared to replication. Standard encodings like Reed-Solomon (10,4) have a 1.4x space overhead, compared to the 3x overhead of standard HDFS replication.

Since erasure coding imposes additional overhead during reconstruction and performs mostly remote reads, it has traditionally been used for storing colder, less frequently accessed data. Users should consider the network and CPU overheads of erasure coding when deploying this feature.

I am looking for the sample configuration files for the same.

Also, even after setting up the ec policy and enabling it using hdfs ec-enablePolicy, does the policy work for cold files only or it is by default implemented to store the entire hdfs files?


Solution

  • In hadoop3 we can enable Erasure coding policy to any folder in HDFS.

    Command to List the supported erasure policies:

    ./bin/hdfs ec -listPolicies

    Command to Enable XOR-2-1-1024k Erasure policy:

    ./bin/hdfs ec -enablePolicy -policy XOR-2-1-1024k

    Command to Set Erasure policy to HDFS directory:

    ./bin/hdfs ec -setPolicy -path /tmp -policy XOR-2-1-1024k

    Command to Get the policy set to the given directory:

    ./bin/hdfs ec -getPolicy -path /tmp

    Command to Remove the policy from the directory.i.e unset policy:

    ./bin/hdfs ec -unsetPolicy -path /tmp

    Command to Disable policy:

    ./bin/hdfs ec -disablePolicy -policy XOR-2-1-1024k

    Edit:

    A sample EC policy XML file named user_ec_policies.xml.template is in the Hadoop conf directory($HADOOP_HOME/etc/hadoop/) available for reference.

    By default REPLICATION policy is always enabled. Erasure coding policy are disabled by default.

    Erasure coding apply for only selected HDFS path, for example if you select /erasure_code_data as your path when setting policy then EC apply only for this directory. And other file already present in HDFS like /tmp /user has REPLICATION policy.