Search code examples
arraysamazon-athenaamazon-quicksight

Cannot access Array elements in AWS Athena


I've got a .txt file, with column with an array of comma-separated strings enclosed in brackets, that I would like to perform some analysis on in AWS Athena/QS. The raw data looks like this:

col_id    col2
1         ["string1", "string2", "string3", "string4"] 
2         ["string1", "string2"]
3         ["string1", "string2", "string3"]
...

I've created a table in Athena with the following:

create external table db.xx (
    col1 string,
    col2 array<string>

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = '  ',
  'field.delim' = ' ',
  'collection.delim' = ','
) LOCATION 's3://xxx'
TBLPROPERTIES ("skip.header.line.count"="1");

The table has created successfully, and the column is recognised as an array data type.

However I am unable to access the elements within the array.

select element_at(col2,1) from table returns:

string1, string2, string3, string4
string1, string2
string1, string2, string3

I've also tried to remove the [] and "" from the raw data, but still get the same results.


Solution

  • CSVs don't have an array type, and there are many ways one could encode an array. Athena unfortunately doesn't automatically figure out which way your data does it, even if you say that a column is of type array<string>.

    There is, however, a workaround: use string for the column type and then cast the value to JSON at query time (since it looks like your arrays are encoded the way JSON encodes string arrays), or use one of the many JSON functions to extract values from the array:

    Create the table like this:

    create external table db.xx (
        col1 string,
        col2 string
    )
    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
    WITH SERDEPROPERTIES (
      'serialization.format' = '  ',
      'field.delim' = ' ',
      'collection.delim' = ','
    ) LOCATION 's3://xxx'
    TBLPROPERTIES ("skip.header.line.count"="1");
    

    and then query it like this:

    SELECT
      col1,
      json_array_get(col2, 0)
    FROM db.xx