Search code examples
arrayshadoopseparator

import complex data structure in hive with custom separators


I have a huge dataset with the following structure

fieldA,fieldB,fieldC;fieldD|fieldE,FieldF;fieldG|fieldH,FieldI ...

where:

fieldA,fieldB and fieldC are strings that should be imported into separate columns

fieldD|fieldE,FieldF;fieldG|fieldH,FieldI is an array (elements separated by semicolon) of maps (elements separated by |) of arrays (elements separated by comma, e.g. fieldE,FieldF)

My problem is that the initial array is separated from the fieldA,fieldB,fieldC with a semicolon. My question is how do I set the separators correctly when I create a table.

This one does not recognize an array - although I provide a semicolon as a field separator

CREATE TABLE string_array(
    first_part STRING # this would be to store fieldA,fieldB,fieldC
   ,second_part ARRAY<STRING> # this would be to store fieldD|fieldE,FieldF;fieldG|fieldH,FieldI and split it by semicolon
       )
ROW FORMAT DELIMITED
    FIELDS TERMINATED BY '\\u003b'
    COLLECTION ITEMS TERMINATED BY '\\u003b'
    MAP KEYS TERMINATED BY '|'
STORED AS TEXTFILE;

LOAD DATA LOCAL INPATH '...' INTO TABLE string_array;

Any ideas how to make it work so I can build upon it? Thanks a lot in advance!


Solution

  • Great question.

    I think that we can break this problem up into two discrete pieces: (1) Hive table structure, and (2) data delimiters.

    Let's start by looking at the Hive table structure. If I understood your data structure correctly (please correct me if I didn't), the table structure that would best describe your data could be represented as:

    CREATE TABLE string_array
    AS
    SELECT 'fieldA,fieldB,fieldC' AS first_part, array(map('fieldD', array('fieldE', 'FieldF')), map('fieldG', array('fieldH','FieldI'))) AS second_part;
    

    Note that the field second_part is an array of maps, where the key to each map references an array of strings. In other words, the field second_part consists of an array within a map within an array.

    If I use the statement above to create a table, I can then copy the resulting table to the local filesystem and look at how Hive assigns default delimiters to it. I know that you don't want to use default delimiters, but please bear with me here. The resulting table looks like this in its serialized on-disk representation:

    00000000  66 69 65 6c 64 41 2c 66  69 65 6c 64 42 2c 66 69  |fieldA,fieldB,fi|
    00000010  65 6c 64 43 01 66 69 65  6c 64 44 04 66 69 65 6c  |eldC.fieldD.fiel|
    00000020  64 45 05 46 69 65 6c 64  46 02 66 69 65 6c 64 47  |dE.FieldF.fieldG|
    00000030  04 66 69 65 6c 64 48 05  46 69 65 6c 64 49 0a     |.fieldH.FieldI.|
    

    If we look at how Hive sees the delimiters we note that Hive actually sees five types or levels of delimiters:

    delimiter 1 = x'01' (between fieldC & fieldD) -- between first_part and second_part
    delimiter 2 = x'02' (between fieldF & fieldG) -- between the two maps in the array of maps
    delimiter 3 = x'03' not used
    delimiter 4 = x'04' (between fieldD & fieldE) -- between the key and the array of fields within the map
    delimiter 5 = x'05' (between fieldE & fieldF) -- between the fields within the array within the map
    

    And herein lies your problem. Current versions of Hive (as of 0.11.0) only allow you to override three levels of delimiters. But due to the levels of nesting within your data, Hive is seeing a requirement for greater than three levels of delimiters.

    My suggestion would be to pre-process your data to use Hive's default delimiters. With this approach you should be able to load your data into Hive and reference it.