Search code examples
amazon-web-servicesamazon-s3aws-glueaws-glue-data-catalogaws-glue-spark

Should I run Glue crawler everytime to fetch latest data?


I have a S3 bucket named Employee. Every three hours I will be getting a file in the bucket with a timestamp attached to it. I will be using Glue job to move the file from S3 to Redshift with some transformations. My input file in S3 bucket will have a fixed structure. My Glue Job will use the table created in Data Catalog via crawler as the input.

First run:

datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "test", table_name = "employee_623215", transformation_ctx = "datasource0")

After three hours if I am getting one more file for employee should I crawl it again?

Is there a way to have a single table in Data Catalog like employee and update the table with the latest S3 file which can be used by Glue Job for processing. Or should I run crawler every time to get the latest data? The issue with that is more number of tables will be created in my Data Catalog.

Please let me know if this is possible.


Solution

  • An alternative approach can be, instead of reading from catalog read directly from s3 and process data in Glue job.

    This way you need not to run crawler again.

    Use

    from_options(connection_type, connection_options={}, format=None, format_options={}, transformation_ctx="")

    Documented here