Search code examples
jsonhivecreate-tablehive-serdehiveddl

Hive external table read json as textfile


I'm trying to create a hive external table for a json file in .txt format. I have tried several approaches but I think I'm going wrong in how the hive external table should be defined:

My Sample JSON is:

[[
{
    "user": "ron",
    "id": "17110",
    "addr": "Some address"
},
{
    "user": "harry",
    "id": "42230",
    "addr": "some other address"


}]]

As you can see it's array inside an array. It seems that this is valid json, returned by an API, although I have read posts saying that json should start with a '{'

Anyway, I am trying to create an external table like this:

    CREATE EXTERNAL TABLE db1.user(
    array<array<
    user:string,
    id:string,
    desc:string
    >>)
  PARTITIONED  BY(date string)
  ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
  STORED AS TEXTFILE
  LOCATION '/tmp/data/addr'

This does not work. Nor does something like this work

CREATE EXTERNAL TABLE db1.user(
    user string,
    id string,
    desc string

 )PARTITIONED  BY(date string)
  ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
  STORED AS TEXTFILE
  LOCATION '/tmp/data/addr'

After trying to modify the json text file, replacing [ with { etc., adding parition I still wasn't able to query it using select *. I'm missing a key piece in the table structure.

Can you please help me so that the table can read my JSON correctly?

If required, I can modify the input JSON, if the double [[ is a problem.


Solution

  • 1st: Row in a table should be represented in a file as single line, no multi-line JSON.

    2nd: You can have array<some complex type> as a single column, but this is not convenient because you will need to explode the array to be able to access nested elements. The only reason you may want such structure is when there are really multiple rows with array<array<>>.

    3rd: Everything in [] is an array. Everything in {} is a struct or map, in your case it is a struct, and you have missed this rule. Fields user, id and desc are inside struct, and struct is nested inside array. Array can have only type in it's definition, if it is nested struct, then it will be array<struct<...>>, If array is of simple type then, for example array<string>.

    4th: Your JSON is not valid one because it contains extra comma after address value, fix it.

    If you prefer to have single column colname containing array<array<struct<...>>> then create table like this:

    CREATE EXTERNAL TABLE db1.user(
        colname array<array<
        struct<user:string,
        id:string,
        desc:string>
        >>)...
    

    And JSON file should look like this (single line for each row):

    [[{"user": "ron","id": "17110","addr": "Some address"}, {"user": "harry","id": "42230","addr": "some other address"}]]
    

    If the file contains single big array nested in another array, better remove [[ and ]], remove commas between structs and extra newlines inside structs. If single row is a struct {}, you can define your table without upper struct<>, only nested structs should be defined as struct<>:

    CREATE EXTERNAL TABLE db1.user(
        user string,
        id   string,
        desc string
       )...
    

    Note, that in this case you do not need : between column name and type. Use : only inside nested structs. And JSON should look like this (whole JSON object as defined in DDL in a single line, no comma between structs, each struct in a separate line):

    {"user": "ron","id": "17110","addr": "Some address"}
    {"user": "harry","id": "42230","addr": "some other address"}
    

    Hope you got how it works. Read more in the JSONSerDe manual.