Search code examples
amazon-web-servicesamazon-athenaaws-sdk-jsaws-glue-data-catalog

Create Glue data catalog via Athena SDK


I would like to use Athena to run queries on data in an S3 bucket in another AWS account. I am using Javascript SDK. Reading through the documentation, I understand that I must first create a data catalog that will point Athena to the correct S3 location.

I think that I have to call the createDataCatalog method. Most of the arguments for this method are self-explanatory, except for the "parameters" argument, which seems to contain information about how the data catalog will be created. But I am unable to find anywhere how these parameters should look.

So my questions are:

  1. What are the parameters to provide to here?
  2. Is this the right way to create a glue data catalog (including database and table)?
  3. Once done, will this allow me to run Athena queries on the data catalog?

Solution

  • For a simple use case with static S3 data,

    • We first need to create Glue Table using Glue createTable API pointing to S3 location. Few Examples in cli documentation.
    • Run queries against this Glue Table from Athena

    Here is an example to create Glue Database and Table

    const AWS = require("aws-sdk");
    AWS.config.update({ region: "us-east-1" });
    
    const glue = new AWS.Glue();
    const dbName = "test-db";
    glue.createDatabase(
      {
        DatabaseInput: {
          Name: dbName,
        },
      },
      function (dbCrtErr, dbRsp) {
        if (dbCrtErr.message === "Database already exists." || dbRsp) {
          console.log("dbRsp", dbRsp);
          glue.createTable(
            {
              DatabaseName: dbName,
              TableInput: {
                Name: "my-table",
                Parameters: {
                  classification: "json",
                  compressionType: "none",
                },
                TableType: "EXTERNAL_TABLE",
                StorageDescriptor: {
                  Location: "s3://my-s3-bucket-with-events/",
                  InputFormat:
                    "org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat",
                  OutputFormat:
                    "org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat",
                  Columns: [
                    {
                      Name: "id",
                      Type: "string",
                    },
                    {
                      Name: "name",
                      Type: "string",
                    },
                  ],
                },
              },
            },
            function (error, response) {
              console.log("error", error, "response", response);
            }
          );
        } else {
          console.log("dbCrtErr", dbCrtErr);
        }
      }
    );