I'm in an IoT project.
I saved the IoT data into S3 from IoT devices. There are 7 kinds of datas, so I saved them into 7 sub folders of S3.
I set my crawler with the following:
-Crawl new sub-folders only
-Create a single schema for each S3 path
When the first time of crawler is done, I changed all the columns' kind of scheme and partiton to string.
It goes well.
But someday the new data's columns will be added, Could you tell me how I should change the crawler's setting to get a new schema which includes all the columns.
According to AWS document here https://docs.aws.amazon.com/glue/latest/dg/crawler-configuration.html, you have three options
Update the table definition in the Data Catalog – Add new columns, remove missing columns, and modify the definitions of existing columns in the AWS Glue Data Catalog. Remove any metadata that is not set by the crawler. This is the default setting.
Add new columns only – For tables that map to an Amazon S3 data store, add new columns as they are discovered, but don't remove or change the type of existing columns in the Data Catalog. Choose this option when the current columns in the Data Catalog are correct and you don't want the crawler to remove or change the type of the existing columns. If a fundamental Amazon S3 table attribute changes, such as classification, compression type, or CSV delimiter, mark the table as deprecated. Maintain input format and output format as they exist in the Data Catalog. Update SerDe parameters only if the parameter is one that is set by the crawler. For all other data stores, modify existing column definitions.
Ignore the change and don't update the table in the Data Catalog – Only new tables and partitions are created.
I recently had the same issue, wherein the crawler was set to default configuration, ie., "Ignore the change and don't update the table in the Data Catalog", but I need to add few columns. So I changed crawler configuration to "Add new columns", and run the crawler to get the new columns in my schema. Please refer to document, it will apply to all data sources in the crawler.