Search code examples
pentahosnowflake-cloud-data-platformpentaho-data-integrationpdisnowsql

Bulk load into Snowflake with Petnatho Data Integration over JDBC is slow


We have several on premise databases and then so far had also our data warehouse as on premise. Now moving over to the cloud and data warehouse will be in Snowflake. But we still have more on premise source systems than in the cloud, so would like to stick with our on premise ETL solution. We are using Pentaho Data Integration (PDI) as our ETL tool.

The issue we have then is then that the PDI Table output step that is using the Snowflake JDBC driver is horribly slow for bulk loads into Snowflake. A year ago it was even worse, as it then just did INSERT INTO and COMMIT after every row. By today it has improved a lot, (when looking at the Snowflake history/logs) it now seems to do some kind of PUT to some temp Snowflake stage, but then from there still does some kind of INSERT to the target table and this is slow (in our test case then it took an hour to load 1 000 000 records in).

We have used the workaround for the bulk load into that we use SnowSQL (Snowflakes command line tool) scrips to make the bulk load into Snowflake that is orchestrated by PDI then. In our example case it takes then less than a minute to get the same 1 000 000 records into Snowflake.

All stuff that is then done inside the Snowflake database is just done via PDI SQL steps sent to Snowflake over JDBC and all our source system queries run fine with PDI. So the issue is only with the bulk load into Snowflake where we need to do some weird workaround:

Instead of:

PDI.Table input(get source data) >> PDI.Table output(write to Snowflake table)

we have then:

PDI.Table input(get source data) >> PDI.Write to local file >> Snowsql.PUT local file to Snowflake Stage >> Snowsql.COPY data from Snowflake Stage to Snowflake table >> PDI clear local file, also then clear Snowflake stage.

It works, but is much more complex than it needs to be (compared to previous on premise database load for example).

I don't even know if this issue is rather on the Snowflake (if the JDBC driver works not optimal) side or on the PDI side (if it just does not utilize the JDBC driver correctly), but would like to have it working better.


Solution

  • To bulk load in Snowflake, you need to do the put and copy.