Search code examples
pythonpandasaws-lambdasnowflake-cloud-data-platform

snowflake-connector-python[pandas] write_pandas creates duplicate records in table


I am attempting to copy data into snowflake on an AWS Lambda. I have a situation right now where I have a dataframe that has no duplicates in it. I verify this by checking my dataframe like so:

df.duplicated().any() and verify that it returns False

I then double check by filtering by what should be a unique value in the dataframe

df[df["myColumn"] == "uniqueValue"] and I get 1 result.

I then run the following:

write_pandas(
            conn=con,
            df=df,
            table_name=table_name,
            database=database,
            schema=schema,
            chunk_size=chunk_size,
            quote_identifiers=False,
        )

and then when the data lands in the Snowflake table and I query it, there are 5 of each row in the SF database.

I verified that this function only runs one time as well.

Why am I getting 5 duplicates?

EDIT OK so I realized it's not related to this package. The issue is that after 1 minute the lambda is triggered again, and then again 1 minute later, etc. until it's been triggered 5 times.

No idea why it's being triggered multiple times though because all of the executions succeed eventually, but there are 5 of them running before the first one actually completes

UPDATE

Verified that it's not a memory issue and not a timeout issue.

What I have noticed is that when an API Call is made to retrieve some external data is when the next lambda seems to be triggered. Not sure why that would play a role but it seems to be affecting it.

Also, it's not set at 5 times, it will just re-trigger every minute until the first lambda execution finishes. I can see that the logs stop when the API call starts, and it's at that same log mark that I see the next lambda execution start.


Solution

  • I'm not sure if this is a Jenkins specific issue or not, but what I found is that I was invoking the function synchronously and after 1 minute, if the lambda had not responded, then it was triggering it again... running with the invoke-async cli option instead of invoke lead to the duplication stopping.