Search code examples
jsonapache-spark-sqlazure-databricks

How to convert records in Azure Databricks delta table to a nested JSON structure?


Let's say I have a delta table in Azure databricks that stores the staff details (denormalized).

sample data illustration

I wanted to export the data in the JSON format and save it as a single file on a storage location. I need help with the databricks sql query to group/construct the data in the JSON format.

Here is the sample code and desired output: Delta Table schema:

%sql
create table if not exists staff_details (
department_id int comment 'id of the department',
department_name string comment 'name of the department',
employee_id int comment 'employee id of the staff',
first_name string comment 'first name of the staff',
last_name string comment 'last name of the staff'
)
using delta

Script to populate the delta table:

%sql
insert into staff_details(department_id, department_name, employee_id, first_name, last_name)
values(1,'Dept-A',101,'Guru','Datt'), (1,'Dept-A',102,'Tom','Cruise'), (2,'Dept-B',201,'Angelina','Jolie')

Show records:

%sql
select * from staff_details order by department_id, employee_id

Desired output:

{
    "staff_details":[
        {
            "department_id":1,
            "department_name": "Dept-A",
            "staff_members": [
                {
                    "employee_id":101,
                    "first_name":"Guru",
                    "last_name":"Datt"
                },
                {
                    "employee_id":102,
                    "first_name":"Tom",
                    "last_name":"Cruise"
                }                
                
            ]
        },
        {
            "department_id":2,
            "department_name": "Dept-B",
            "staff_members": [
                {
                    "employee_id":201,
                    "first_name":"Angelina",
                    "last_name":"Jolie"
                }             
                
            ]
        }
    ]
}

I tried using the to_json() function, and also using manual string concatenation with group by etc, but none of that is working well.


Solution

  • The below spark sql query will convert the staff_details table in nested json format.

    select collect_set(to_json(struct(*))) as staff_details 
    from (SELECT department_id, department_name,
    collect_set(to_json(struct(employee_id, first_name,
    last_name))) as staff_members FROM staff_details
    group by department_id,department_name)a
    

    enter image description here

    • You can save the query result to data frame and then using pandas, save the data frame in a single file. Below is the code.
    %python
    df=spark.sql("select collect_set(to_json(struct(*))) as json_data from (select collect_set(to_json(struct(*))) as staff_details from (SELECT department_id,department_name,collect_set(to_json(struct(employee_id,first_name,last_name))) as staff_members FROM staff_details group by department_id,department_name)a)b")
    #df.show()
    import pandas
    pandas_converted=df.toPandas()
    pandas_converted.to_json("/dbfs/tmp/staff_details3.json")