Search code examples
python-3.xhbasethrifthappybasehdp

unable to upload pdf files of size more than 10MB in Hbase via python happybase - HDP 3


We are using HDP 3. We are trying to insert PDF files in one of the columns of a particular column family in Hbase table. Developing environment is python 3.6 and the hbase connector is happybase 1.1.0.

We are unable to upload any PDF file greater than 10 MB in hbase.

In hbase we have set the parameters as follows: enter image description here

enter image description here

We get the following error:

IOError(message=b'org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 1 action: org.apache.hadoop.hbase.DoNotRetryIOException: Cell with size 80941994 exceeds limit of 10485760 bytes\n\tat org.apache.hadoop.hbase.regionserver.RSRpcServices.checkCellSizeLimit(RSRpcServices.java:937)\n\tat org.apache.hadoop.hbase.regionserver.RSRpcServices.doBatchOp(RSRpcServices.java:1010)\n\tat org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicBatchOp(RSRpcServices.java:959)\n\tat org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicRegionMutation(RSRpcServices.java:922)\n\tat org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:2683)\n\tat org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:42014)\n\tat org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:409)\n\tat org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:131)\n\tat org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324)\n\tat


Solution

  • You have to check the hbase source code to see what is happening:

    private void checkCellSizeLimit(final HRegion r, final Mutation m) throws IOException {
        945    if (r.maxCellSize > 0) {
        946      CellScanner cells = m.cellScanner();
        947      while (cells.advance()) {
        948        int size = PrivateCellUtil.estimatedSerializedSizeOf(cells.current());
        949        if (size > r.maxCellSize) {
        950          String msg = "Cell with size " + size + " exceeds limit of " + r.maxCellSize + " bytes";
        951          if (LOG.isDebugEnabled()) {
        952            LOG.debug(msg);
        953          }
        954          throw new DoNotRetryIOException(msg);
        955        }
        956      }
        957    }
        958  }
    

    Based on the error message you are exceeding the r.maxCellSize.

    Note on above: The function PrivateCellUtil.estimatedSerializedSizeOf is depreciated and will be removed in the future versions.

    Here is its description:

    Estimate based on keyvalue's serialization format in the RPC layer. Note that there is an extra SIZEOF_INT added to the size here that indicates the actual length of the cell for cases where cell's are serialized in a contiguous format (For eg in RPCs).

    You have to check where is the value set. First check the "ordinary" values at HRegion.java

    this.maxCellSize = conf.getLong(HBASE_MAX_CELL_SIZE_KEY, DEFAULT_MAX_CELL_SIZE);

    So there is probably a HBASE_MAX_CELL_SIZE_KEY and DEFAULT_MAX_CELL_SIZE limit somewhere:

    public static final String HBASE_MAX_CELL_SIZE_KEY = "hbase.server.keyvalue.maxsize";
    public static final int DEFAULT_MAX_CELL_SIZE = 10485760;
    

    Here you have your 10485760 limit which shows at your error message. If you need you can try raising this limit to your limit value. I recommend testing it properly before going live with it (the limit there has probably some reason behind it).

    Edit: Adding information about how to change the value of base.server.keyvalue.maxsize. Check the config.files:

    Where you can read:

    hbase.client.keyvalue.maxsize (Description)

    Specifies the combined maximum allowed size of a KeyValue instance. This is to set an upper boundary for a single entry saved in a storage file. Since they cannot be split it helps avoiding that a region cannot be split any further because the data is too large. It seems wise to set this to a fraction of the maximum region size. Setting it to zero or less disables the check. Default

    10485760
    

    hbase.server.keyvalue.maxsize (Description)

    Maximum allowed size of an individual cell, inclusive of value and all key components. A value of 0 or less disables the check. The default value is 10MB. This is a safety setting to protect the server from OOM situations. Default

    10485760