AWS Glue Data Catalog is composed of different structures, e.g.
Database,
Table,
Partition,
Column etc. Haven't looked through every one of them, but it seems that Parameters
fields (a map array of key-value pairs) is present in all of them. I have noticed that if table was created by a crawler then we can see something like:
{
"CrawlerSchemaDeserializerVersion": "1.0",
"CrawlerSchemaSerializerVersion": "1.0",
"UPDATED_BY_CRAWLER": "some-crawler-name",
"averageRecordSize": "12",
"classification": "parquet",
"compressionType": "none",
"objectCount": "123",
"recordCount": "1234567",
"sizeKey": "1234567890",
"typeOfData": "file"
}
for Table["Parameters"]
as well as Table["StorageDescriptor"]["Parameters"]
. If our table has partitions, then each one of them would have the same dictionary, but with different values for averageRecordSize, objectCount, recordCount, sizeKey. After summing them up we end up with the same values as in Table["Parameters"]
. All of it makes sense, and I guess these values determine crawlers' logic when we want to re-run it on demand or on schedule.
Instead of using crawlers, I manage multiple AWS Glue catalogs manually with boto3 and airflow. For example, I could copy partitions definition from db_1.table_1
in catalog 12345 into db_2.table_2
in catalog 6789, or define additional meta-parameters in table_1
. However, this Parameters field is still quite a mystery to me and I couldn't find any sort of documentation related to it.
It looks like some keys, e.g. recordCount
, are reserved for internal use within AWS Glue (although they can be defined manually).
Docs mention that these key-value pairs define properties associated with the table and some restrictions:
- Each key is a Key string, not less than 1 or more than 255 bytes long, matching the Single-line string pattern.
- Each value is a UTF-8 string, not more than 512000 bytes long.
Is there any restriction on how many keys can Parameters
field contain? Does number of these key-value pairs affect performance when you query data?
Parameters
field in sync for table, partition and their storage descriptorsUnfortunately, there is no complete document explaining all TBLPROPERTIES.
SHOW TBLPROPERTIES <table_name>
I don't think it impacts performance negatively. Internal properties are usually used for certain activities e.g. crawled_by
is used by Glue , numRows
is used by Athena/Redshift, has_encrypted_data
is used to check if s3 data is encrypted or not and so on.
skip.header.line.count=2
can skip first two rows and will not treat as data rows.