Search code examples
amazon-web-servicesamazon-redshiftjsonpathamazon-kinesis

Kinesis to Redshift COPY Command using JSONPaths File Error


I am trying to send records of API calls to S3 then only specific fields to redshift. For example, my program sends the following JSON to S3 successfully:

{

"user": "user_name",
"type": "GET",
"date": "2016-03-22 16:14:13",
"data": [{
    "EmpID": 1,
    "LastName": "Smith",
    "FirstName": "Bob",
    "DOB": "1910-01-01",
    "SSN": "123456789",
    "HireDate": "1910-01-01",
    "City": "Town",
    "Address": "123 Abc"
}]

}

The "data" field may contain one record like the example above or many records if the query returns multiple results. I want to copy only the information from data into the "Employee" table on redshift with the same column names. Copy with json 'auto' does not work because it parses only top level fields. In an attempt to solve this, I created the following JSON paths file:

{

"jsonpaths": [
    "$.data[0].EmpID",
    "$.data[0].LastName",
    "$.data[0].FirstName",
    "$.data[0].DOB",
    "$.data[0].SSN",
    "$.data[0].HireDate",
    "$.data[0].City",
    "$.data[0].Address",
]

}

Here is an example run of my program:

  1. Run my AWS Java web project on Tomcat server
  2. Enter http://localhost:000/projectname/rest/employee/1 (which successfully returns a json object for employee id 1 from my MS SQL server database)
  3. Creates the json object at the top of this post with metadata about the call and the result of the query
  4. Writes this object successfully to 's3://bucket/prefix/'
  5. Tries to run the copy command and fails. When I go to the load tab of my redshift cluster for more information about why the load failed the status of the command is terminated and the SQL field says "COPY ANALYZE employee"
  6. When I try to run the manual version of the copy command below, I get the following error: "ERROR: XX000: Manifest file is not in correct json format"

    COPY employee
    FROM 's3://bucket/prefix/filename'
    CREDENTIALS 'aws_access_key_id=<>;aws_secret_access_key=<>'
    JSON 's3://bucket/jsonpaths.json';
    

In an effort to address the manifest problem, I created the following file:

{

  "entries": [
    {"url":"s3://bucket/prefix/file", "mandatory":true}
  ]

}

then ran the copy command:

    COPY employee
    FROM 's3://bucket/manifest.json'
    CREDENTIALS'aws_access_key_id=<>;aws_secret_access_key=<>' 
    MANIFEST

This yields the stl_load_error: "Delimiter not found" with the raw field value as the full json file saved to s3. If I add the jsonPaths file line from the previous copy command, I get the same manifest error as before.

Questions:

-Is my JSONPaths file in the correct format?

-Is this the best way to extract from lower levels of a JSONArray?

-Is the format of my manifest file correct? Why is a manifest file necessary?

Note: before I added the metadata, I tested the s3 to redshift connection with just the fields I wanted to load and it loaded successfully, so it is unlikely my redshift destination is improperly configured.

Thank you for your help!


Solution

  • From what I remember, when you try to load JSON data and your JSONPaths file has a syntax error, Redshift throws the (incorrect and misleading) exception that your manifest file is ill-formed. So, I am pretty sure that the following command would work, if there are no syntax errors in your JSONPaths file:

    COPY employee
    FROM 's3://bucket/prefix/filename'
    CREDENTIALS 'aws_access_key_id=<>;aws_secret_access_key=<>'
    JSON 's3://bucket/jsonpaths.json';
    

    Now, what's wrong with your JSONPaths file? To me, it looks fine, but I can suggest couple of things you should try:

    • Remove the comma in the last entry, i.e., change "$.data[0].Address", to "$.data[0].Address".
    • If the above doesn't work out, try without using dot notation. For e.g., change "$.data[0].EmpID" to "$['data'][0]['EmpID']".

    P.S.- This is completely wrong:

    COPY employee
    FROM 's3://bucket/manifest.json'
    CREDENTIALS'aws_access_key_id=<>;aws_secret_access_key=<>' 
    MANIFEST
    

    That is because you are telling Redshift to use a manifest file, but not that the input data is in JSON format or what JSONPaths to use. So, by default,

    the COPY command expects the source data to be in character-delimited UTF-8 text files. The default delimiter is a pipe character ( | ).