Search code examples
sql-serverazureazure-data-factory

Find and load related tables / data to move from a SQL Server database in ADF


I'm facing the following problem: One of the colleages from the data science department set up a ADF pipeline which should copy data from our production database into our development database. To avoid errors with related data and writing permissions, the data from the production environment is stored in a newly created schema 'stg' within the development database. For sizing reasons, only the top 1000 entries of each table will be copied. The data should be stored at the end in the 'dbo' schema with all constraints and foreign keys. The copying from the tables in the stg schema to the dbo schema is solved with a stored procedure which only exists in the development database. So far the pipeline works. But when it comes to the copying from one schema to the other, I get errors with the foreign key relationships. This is, because it's not sure that the first 1000 entries of table A are linked to the first 1000 entries of table B. It could be, that entry nr. X of table A is linked to entry 1000 + Y in table B. My question: Is there any simple solution to solve this within the ADF pipeline or do I have to write some SQL code to fetch all related data from all related tables within our pipeline?

Here is the picture of our pipeline: Pipeline: First search for all tables, then loop through all entries and fetch first 1000 entries and store them into the destination database

Here is our Lookup code

SELECT TABLE_NAME
FROM INFORMATION_SCHEMA.TABLES
WHERE table_type = 'BASE TABLE' and TABLE_SCHEMA ='dbo' and TABLE_CATALOG ='[database name]'

And with this we extract the first 1000 lines from the tables given by the lookup (within the copy activity):

@concat('select top 1000 * from  ', ' dbo.',item().TABLE_NAME)

So far we tried to enlarge the result set in the SELECT statement or adding some sorting algorithms, but both tries ended in the same error with the missing foreign keys.


Solution

  • There is no direct way to copy only related data from all tables in ADF. To do this, Below is the apprach

    • First, copy 1,000 records of the parent/base table from the production environment to the development environment database.
    • Then, create a lookup table and store the list of table names in hierarchical order so that all dependent tables are copied beforehand.
    • Take the lookup activity and use the lookup table as a dataset.

    Sample lookup table

    Order Table Related Table Lookup_Query Copy_Source_Query
    1 child_table1 parent_table1 SELECT DISTINCT fk_childtable1 FROM parent_table1 SELECT * FROM department WHERE dept_id IN (@{variables('arrayvar1')})
    2 child_table2 child_table1 SELECT DISTINCT fk_childtable2 FROM child_table1 SELECT * FROM staff WHERE dept_id IN (@{variables('arrayvar1')})
    • Use the for-each activity and check the sequential option in the settings.
    • Inside the for-each activity, take the lookup activity and use the dev environment database as the lookup activity source dataset.
    • Provide the query from the lookup table (field name: Lookup_query) to take the list of values in the dependent parent table from the lookup activity.
    • Convert the lookup activity output to an array. Refer to this StackOverflow thread for doing this.
    • Take the copy activity, and in the source settings, provide the query to take the data from the prod environment that is available in the lookup activity array.