I'm looking for any best practices when loading data from a source that in it's self is slow - either from high load on the server, low bandwidth or something else - and where you can't do anything about the performance on the source, only on the loading.
I have a data source, Pervasive PSQL Server 11.2, and a table with 30M+ records. I want to extract only the last two years of data. Anything using a WHERE statement makes the query run for 60+ minutes without sending any data. Doing a
SELECT * FROM [table];
works, but gets me 10+ years of data I don't need and takes so long I'm not able to run it every night. Setting AutoAdjustBufferSize to true improves performance but not nearly enough.
The date column is a DATE datatype column. I've tried YEAR(column) > 2018 and tried BETWEEN and converting the first four characters to INT but nothing have sped up the load time.
We ended up doing a two stage solution. We load previous years into history tables (one time load or yearly maybe), then load current year weekly and load previous month + current month in a daily load. Limiting data that much made it possible to use WHERE statements. And I think it was a good enough data load pattern to use.