Search code examples
jsonsql-serverazure-data-factorydata-transform

Join multiple tables in Azure Data Factory to create a structure with an embedded array field?


I am trying to set up an Azure Data Factory transformation. I have a SQL Server database with three tables: Students, StudentClasses and Classes. I would like to use Azure Data Factory to read these tables and create a JSON structure. The structure is an array of Students. A Student should have a derived field which is an array of Classes. StudentClasses is a mapping that tells which Classes were taken by a given student. It has two fields StudentId and ClassId. I think I should be able to add a derived column to Student that is an array of Classes created using the join transformation. I have gotten as far as setting up the data flows for each of the tables. However, I can't figure out how to embed the array of classes in student. I have tried adding a derived array field to Student. However, it is not clear how to populate it with the content I desire.

I can add a derived field called class to Class with the following expression:

array(classID, className)

The end result I am looking for is:

{
  "students": [
    {
      "studentID": 1,
      "studentName": "John Doe",
      "classes": [
        {
          "classID": "CS101-01",
          "className": "Introduction to Computer Science"
        },
        {
          "classID": "CS102-01",
          "className": "Mathematics for Computer Science"
        }
      ]
    },
    {
      "studentID": 2,
      "studentName": "Jane Smith",
      "classes": [
        // ... (similar structure for Jane Smith)
      ]
    }
    // ... (additional students with their respective classes)
  ]
}

Solution

  • Array(classID, className) expression in derived column transformation will create an array of ClassID, ClassName. But this will not create as proper key name in key:value format created. Also, this will not be in the nested format under each student. To solve this, you can give expression like collect(@(ClassId=ClassId,ClassName=ClassName)) in the aggregate transformation. Below is the detailed approach.

    • Join all the three tables using join transformations. First Join students and studentsClasses table based on the studentid field. Then Join the resultant table with classes table based on classid field. Then remove the duplicate columns which came as a result of join transformations using select transformation.

    enter image description here

    Data Preview of select transformation after joining all three tables:

    StudentId StudentName ClassId ClassName
    1 John Doe CS101-01 Introduction to Computer Science
    1 John Doe CS102-01 Mathematics for Computer Science
    2 Jane Smith CS101-01 Introduction to Computer Science
    2 Jane Smith CS103-01 Data Structures and Algorithms
    3 Bob Johnson CS102-01 Mathematics for Computer Science
    3 Bob Johnson CS104-01 Database Systems
    • After select transformation, take the aggregate transformation.In aggregate settings, Add the columns StudentId and StudentName in group by fields and in aggregates, create a new column named Classes and give the expression as collect(@(ClassId=ClassId,ClassName=ClassName))

    enter image description here

    Data preview of aggregate transformation:

    enter image description here

    • Take another aggregate transformation and do not add any column in group by fields. In Aggregates, create a new column named Students. Give the expression as collect(@(StudentId=StudentId,StudentName=StudentName,Classes=Classes))

    enter image description here

    Data preview of Aggragate2:

    enter image description here

    Output Json:

    {
      "Students": [
        {
          "StudentId": "1",
          "StudentName": "John Doe",
          "Classes": [
            {
              "ClassId": "CS101-01",
              "ClassName": "Introduction to Computer Science"
            },
            {
              "ClassId": "CS102-01",
              "ClassName": "Mathematics for Computer Science"
            }
          ]
        },
        {
          "StudentId": "2",
          "StudentName": "Jane Smith",
          "Classes": [
            {
              "ClassId": "CS101-01",
              "ClassName": "Introduction to Computer Science"
    ...........
            }
        ]
        }
    }