Search code examples
amazon-web-servicesaws-glueaws-glue-data-catalog

AWS Glue Catalog API: Parameters field in metadata of different structures


AWS Glue Data Catalog is composed of different structures, e.g. Database, Table, Partition, Column etc. Haven't looked through every one of them, but it seems that Parameters fields (a map array of key-value pairs) is present in all of them. I have noticed that if table was created by a crawler then we can see something like:

{
    "CrawlerSchemaDeserializerVersion": "1.0",
    "CrawlerSchemaSerializerVersion": "1.0",
    "UPDATED_BY_CRAWLER": "some-crawler-name",
    "averageRecordSize": "12",
    "classification": "parquet",
    "compressionType": "none",
    "objectCount": "123",
    "recordCount": "1234567",
    "sizeKey": "1234567890",
    "typeOfData": "file"
}

for Table["Parameters"] as well as Table["StorageDescriptor"]["Parameters"]. If our table has partitions, then each one of them would have the same dictionary, but with different values for averageRecordSize, objectCount, recordCount, sizeKey. After summing them up we end up with the same values as in Table["Parameters"]. All of it makes sense, and I guess these values determine crawlers' logic when we want to re-run it on demand or on schedule.

Instead of using crawlers, I manage multiple AWS Glue catalogs manually with boto3 and airflow. For example, I could copy partitions definition from db_1.table_1 in catalog 12345 into db_2.table_2 in catalog 6789, or define additional meta-parameters in table_1. However, this Parameters field is still quite a mystery to me and I couldn't find any sort of documentation related to it.

It looks like some keys, e.g. recordCount, are reserved for internal use within AWS Glue (although they can be defined manually).

  1. Do other services (especially Athena) use them as well?
  2. Where can I find a list of such keys and their meaning, so that my keys wouldn't interfere?
  3. Docs mention that these key-value pairs define properties associated with the table and some restrictions:

    1. Each key is a Key string, not less than 1 or more than 255 bytes long, matching the Single-line string pattern.
    2. Each value is a UTF-8 string, not more than 512000 bytes long.

    Is there any restriction on how many keys can Parameters field contain? Does number of these key-value pairs affect performance when you query data?

  4. How important is it to keep Parameters field in sync for table, partition and their storage descriptors

Solution

    1. Yes, e.g. Redshift spectrum uses it to optimize query plan - https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_EXTERNAL_TABLE.html#r_CREATE_EXTERNAL_TABLE-parameters
    2. Unfortunately, there is no complete document explaining all TBLPROPERTIES.

    3. I don't think it impacts performance negatively. Internal properties are usually used for certain activities e.g. crawled_by is used by Glue , numRows is used by Athena/Redshift, has_encrypted_data is used to check if s3 data is encrypted or not and so on.

    4. It is important to keep in sync as these properties are used to manage table and queries on it efficiently. Certain properties e.g. skip.header.line.count=2 can skip first two rows and will not treat as data rows.