I have a bunch of U-SQL activities that manipulates & transform data in an Azure Data Lake
. Out of this, I get a csv
file that contains all my events.
Next I would just use a Copy Data
activity to copy the csv
file from the Data Lake directly into an Azure SQL Data Warehouse
table.
JSON
files stored in the Data Lake and create a staging .csv
file;.csv
file & a production .csv
file and inject the latest change (and avoid duplicates) and save the production .csv
file;.csv
production file directly to the Warehouse table.I realized that my table contains duplicated rows and, after having tested the U-SQL
scripts, I assume that the Copy Data
activity -somehow- merges the content of the csv
file into the table.
Question
I am not convinced I am doing the right thing here. Should I define my warehouse table as an EXTERNAL table that would get its data from the .csv
production file? Or should I change my U-SQL to only include the latest changes?
If you want to use external tables depends on your use case. If you want the data to be stored inside SQL DW for better performance, you have to copy it at some point, e.g. via a stored procedure. You could then just call the stored procedure from ADF, for instance.
Or, if you don't want to / cannot filter out data beforehand, you could also implement an "Upsert" stored procedure in your SQL DW and call this to insert your data instead of the copy activity.