Search code examples
jsonhadoophiveemrmetastore

Hive Metastore column width limit


Using AWS EMR on the 5.2.1 version as data processing environment, when dealing with a huge JSON file that has a complex schema with many nested fields, Hive can't process it and errors as it reaches the current limit of 4000 characters column length.

Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. InvalidObjectException(message:Invalid column type name is too long: [...]

Looking at the documentation, there's already many issues regarding this problem or similar, though all unresolved [1, 2]. In this it is recommended to alter several fields of the Metastore to a different value in order to allow a greater length for struct definitions.

  • COLUMNS_V2.TYPE_NAME
  • TABLE_PARAMS.PARAM_VALUE
  • SERDE_PARAMS.PARAM_VALUE
  • SD_PARAMS.PARAM_VALUE

As stated in the first issue, the proposed solution mentions:

[...] after setting the values, the Metastore must also be configured and restarted."

Yet it is nowhere specified what must be also configured beside the DB values.

Thus, after updating the fields the current local Metastore (mysql in this case) from string to mediumtext and restarting the Metastore process, still can not get any progress as attempting to load the JSON continues to fail with same error.

Am I missing something or has anyone found an alternative workaround to overcome this problem?


Solution

  • Looking at the MetaStore source code as the Hive issue HIVE-15249 relates, there is also an additional check that applies beside the parameter constraints set in the MetaStore tables.

    Particularly, in the HiveMetaStoreUtils.java file, the following declaration seems to be cause of the execution error:

    public static final int MAX_MS_TYPENAME_LENGTH = 2000; // 4000/2, for an unlikely unicode case
    

    Thus, changing the values for the mentioned in the MetaStore is not enough as the code constraint will throw the exception.