Search code examples
amazon-web-servicesamazon-s3amazon-redshiftaws-glueaws-glue-data-catalog

Create tables in Glue Data Catalog for data in S3 and unknown schema


My current use case is, in an ETL based service (NOTE: The ETL service is not using the Glue ETL, it is an independent service), I am getting some data from AWS Redshift clusters into the S3. The data in S3 is then fed into the T and L jobs. I want to populate the metadata into the Glue Catalog. The most basic solution for this is to use the Glue Crawler, but the crawler runs for approximately 1 hour and 20 mins(lot of s3 partitions). The other solution that I came across is to use Glue API's. However, I am facing the issue of data type definition in the same.

Is there any way, I can create/update the Glue Catalog Tables where I have data in S3 and the data types are known only during the extraction process.

But also, when the T and L jobs are being run, the data types should be readily available in the catalog.


Solution

  • Found a solution to the problem, I ended up utilising the Glue Catalog API's to make it seamless and fast. I created an interface which interacts with the Glue Catalog, and override those methods for various data sources. Right after the data has been loaded into the S3, I fire the query to get the schema from the source and then the interface does its work.