Search code examples
jsonescapingaws-glueamazon-athenapresto

Can't Query AWS Athena Presto Table Because of Dash Character in Column name


I have a file in S3 with the following contents:

{"foo-bar": {"name":"Mercury","distanceFromSun":0.39,"orbitalPeriod":0.24,"dayLength":58.65}}{"foo-bar": {"name":"Venus","distanceFromSun":0.72,"orbitalPeriod":0.62,"dayLength":243.02}}

The Glue table definition is:

CREATE EXTERNAL TABLE `planets_ion_2`(
  `foo-bar` struct<name:string,distancefromsun:decimal(38,18),orbitalperiod:decimal(38,18),daylength:decimal(38,18)> COMMENT '')
ROW FORMAT SERDE 
  'com.amazon.ionhiveserde.IonHiveSerDe' 
STORED AS INPUTFORMAT 
  'com.amazon.ionhiveserde.formats.IonInputFormat' 
OUTPUTFORMAT 
  'com.amazon.ionhiveserde.formats.IonOutputFormat'
LOCATION
  's3://<some_bucket>/AthenaDataStore/Planets2'
TBLPROPERTIES (
  'transient_lastDdlTime'='1740775321')

I get null values for the "name" key when I query Athena like so:

select "foo-bar".name from planets_ion_2 

enter image description here

If I remove the dash from the top level key in the S3 json, from "foo-bar" to "foobar", it works fine: {"foobar": {"name":"Mercury","distanceFromSun":0.39,"orbitalPeriod":0.24,"dayLength":58.65}}{"foobar": {"name":"Venus","distanceFromSun":0.72,"orbitalPeriod":0.62,"dayLength":243.02}}

CREATE EXTERNAL TABLE `planets_ion_2`(
  `foobar` struct<name:string,distancefromsun:decimal(38,18),orbitalperiod:decimal(38,18),daylength:decimal(38,18)> COMMENT '')
ROW FORMAT SERDE 
  'com.amazon.ionhiveserde.IonHiveSerDe' 
STORED AS INPUTFORMAT 
  'com.amazon.ionhiveserde.formats.IonInputFormat' 
OUTPUTFORMAT 
  'com.amazon.ionhiveserde.formats.IonOutputFormat'
LOCATION
  's3://<some_bucket>/AthenaDataStore/Planets2'
TBLPROPERTIES (
  'transient_lastDdlTime'='1740775321')

enter image description here

So I can't seem to escape the dash. I've tried a lot of things without success:

select `foo-bar`.name from planets_ion_2 
select `foo-bar.name` from planets_ion_2 

Any time I try to escape with backtick, I get the error Queries of this type are not supported

I'm not sure if this a Athena or Presto issue. But thanks in advance for any help.

enter image description here


Solution

  • You need to add the following after ROW FORMAT SERDE

    WITH SERDEPROPERTIES ("ion.foo-bar.path_extractor" = "('foo-bar')")

    With the dash in the name it can't find the path to the data. Use the property path_extractor to create an alias for it with single quotes around the key name since it has a dash in it.

    So the create table statement would look something like:

    CREATE EXTERNAL TABLE `planets_ion_2`(
      `foo-bar` struct<name:string,distancefromsun:decimal(38,18),orbitalperiod:decimal(38,18),daylength:decimal(38,18)> COMMENT '')
    ROW FORMAT SERDE 
      'com.amazon.ionhiveserde.IonHiveSerDe'
    WITH SERDEPROPERTIES ("ion.foo-bar.path_extractor" = "('foo-bar')")
    STORED AS INPUTFORMAT 
      'com.amazon.ionhiveserde.formats.IonInputFormat' 
    OUTPUTFORMAT 
      'com.amazon.ionhiveserde.formats.IonOutputFormat'
    LOCATION
      's3://<some_bucket>/AthenaDataStore/Planets2'
    TBLPROPERTIES (
      'transient_lastDdlTime'='1740775321')
    

    You should then be able to query name using

     select "foo-bar".name from from planets_ion_2
    

    Side note - Amazon Ion Hive SerDe is used to query Amazon Ion format data, but can be used to query non-Amazon Ion JSON datasets.

    See this note in the link

    Because Amazon Ion is a superset of JSON, you can use the Amazon Ion Hive SerDe to query non-Amazon Ion JSON datasets. Unlike other JSON SerDe libraries, the Amazon Ion SerDe does not expect each row of data to be on a single line. This feature is useful if you want to query JSON datasets that are in "pretty print" format or otherwise break up the fields in a row with newline characters.

    If you explore the other JSON SerDe libraries you would still need to account for the dash in the key name.

    As an example using OpenX JSON SerDe you would have to provide the property like

    WITH SERDEPROPERTIES ("mapping.foo-bar" = "foo-bar")

    to "remap" that path to account for the dash.

        CREATE EXTERNAL TABLE `planets_ion_2`(
          `foo-bar` struct<name:string,distancefromsun:decimal(38,18),orbitalperiod:decimal(38,18),daylength:decimal(38,18)> COMMENT '')
    ROW FORMAT SERDE 
      'org.openx.data.jsonserde.JsonSerDe' 
    WITH SERDEPROPERTIES ("mapping.foo-bar" = "foo-bar")
        LOCATION
          's3://<some_bucket>/AthenaDataStore/Planets2'