Search code examples
pythonjsondictionarynestedmetadata

Combine nested dictionaries from dot-separated keys in Python and output to JSON


Introduction and explanation of data structure

I have metadata extracted from an SEM acquisition that is structured as three separate dictionaries: acquisitionMetadata, datasetMetadata, and imageMetadata. Each dictionary contains key-value pairs, where the keys are dot-separated strings representing the hierarchical levels.

acquisitionMetadata is simply a dictionary as described above.

datasetMetadata is a list of dictionaries of the same structure, where each dictionary represents the metadata for a specific dataset within the acquisition. imageMetadata is also a list of dictionaries, where each element in the list corresponds to a dataset and contains another list of dictionaries representing the metadata for each image within that dataset.

What I need to do

I need to combine these three dictionaries into one nested dictionary in Python (and eventually a JSON file), where the keys represent the hierarchy levels. For example, 'acquisition.dataset.images.creationTime': '18.08.2020 17:51:07' means that I want a value of '18.08.2020 17:51:07' to be stored under acquisition{dataset{images{creationTime:18.08.2020 17:51:07}}}.

My problem

The main issue I'm having arises when we get to the lists and the nested structure. I can't get it to dynamically build the arrays under "dataset" and "images" in the way that I want it to, either it repeats the "acquisition", "dataset", and/or "image" keys when it is already under them, or it places the image dictionaries outside of the array of datasets. The chatbot has gotten me close, but no matter how I describe the issue it can't get it right. It also insists on hardcoding the level names/keys, and I don't want that obviously.

Desired end result

For reference, the combined dictionary (and output JSON) should have the following structure (obviously not each key/variable is shown) when created with the variables in my minimal working example :

metadata = {
    'acquisition': {
        'genericMetadata': {
            'program': {
                'programName': 'Auto Slice & View 4',
                'programVersion': '4.2.1.1982'
            },
            'applicationId': {
                'identifierValue': 'ASV'
            },
            'fileVersion': '1.2',
            'projectName': '20200818_AlSi13 XRM tomo2',
            'numberOfCuts': '719'
        },
        'dataset': [
            {
                'rows': '1',
                'columns': '1',
                'images': [
                    {
                        'creationTime': '18.08.2020 17:51:07',
                        'stage': {
                            'workingDistance': {
                                'value': '0.00403678'
                            }
                        }
                    },
                    {
                        'creationTime': '18.08.2020 18:09:06',
                        'stage': {
                            'workingDistance': {
                                'value': '0.00403773'
                            }
                        }
                    }
                ]
            },
            {
                'rows': '1',
                'columns': '1',
                'images': [
                    {
                        'creationTime': '18.08.2020 17:51:07',
                        'stage': {
                            'workingDistance': {
                                'value': '0.00403678'
                            }
                        }
                    },
                    {
                        'creationTime': '18.08.2020 18:09:06',
                        'stage': {
                            'workingDistance': {
                                'value': '0.00403773'
                            }
                        }
                    }
                ]
            }
        ]
    }
}

Minimal working example {#mwe}

Here is a minimal working example of what the dictionaries look like that I am inputting into such a function. You can copy and paste this into your IDE to recreate the inputs I'm working with.

acquisition_metadata = {
'acquisition.genericMetadata.program.programName': 'Auto Slice & View 4',
 'acquisition.genericMetadata.program.programVersion': '4.2.1.1982',
 'acquisition.genericMetadata.applicationId.identifierValue': 'ASV',
 'acquisition.genericMetadata.fileVersion': '1.2',
 'acquisition.genericMetadata.projectName': '20200818_AlSi13 XRM tomo2',
 'acquisition.genericMetadata.numberOfCuts': '719',
}

dataset_metadata = [
    {
        'acquisition.dataset.rows': '1',
        'acquisition.dataset.columns': '1',
    },
    {
        'acquisition.dataset.rows': '1',
        'acquisition.dataset.columns': '1',
    },
]

image_metadata = [
    [
        {
            'acquisition.dataset.images.creationTime': '18.08.2020 17:51:07',
            'acquisition.dataset.images.stage.workingDistance.value': '0.00403678',
        },
        {
            'acquisition.dataset.images.creationTime': '18.08.2020 18:09:06',
            'acquisition.dataset.images.stage.workingDistance.value': '0.00403773',
        }
    ],
    [
        {
            'acquisition.dataset.images.creationTime': '18.08.2020 17:51:07',
            'acquisition.dataset.images.stage.workingDistance.value': '0.00403678',
        },
        {
            'acquisition.dataset.images.creationTime': '18.08.2020 18:09:06',
            'acquisition.dataset.images.stage.workingDistance.value': '0.00403773',
        }
    ]
]

What I have tried to do:

Here is what I have tried (with the help of our friend "Gee Pee Tee"):

import json
import os

def combine_metadata(acquisition_metadata, dataset_metadata, image_metadata):
    metadata = {}
    
    # Combine acquisition metadata
    for key, value in acquisition_metadata.items():
        nested_keys = key.split('.')
        current_dict = metadata
        
        for nested_key in nested_keys[:-1]:
            if nested_key not in current_dict:
                current_dict[nested_key] = {}
            current_dict = current_dict[nested_key]
        
        current_dict[nested_keys[-1]] = value
    
    # Combine dataset metadata
    metadata['acquisition']['dataset'] = []
    for dataset in dataset_metadata:
        dataset_dict = {}
        for key, value in dataset.items():
            nested_keys = key.split('.')
            current_dict = dataset_dict
            
            for nested_key in nested_keys[:-1]:
                if nested_key not in current_dict:
                    current_dict[nested_key] = {}
                current_dict = current_dict[nested_key]
            
            current_dict[nested_keys[-1]] = value
        
        metadata['acquisition']['dataset'].append(dataset_dict)
    
    # Combine image metadata
    for i, images in enumerate(image_metadata):
        metadata['acquisition']['dataset'][i]['images'] = []
        for image in images:
            image_dict = {}
            for key, value in image.items():
                nested_keys = key.split('.')
                current_dict = image_dict
                
                for nested_key in nested_keys[:-1]:
                    if nested_key not in current_dict:
                        current_dict[nested_key] = {}
                    current_dict = current_dict[nested_key]
                
                current_dict[nested_keys[-1]] = value
            
            metadata['acquisition']['dataset'][i]['images'].append(image_dict)
    
    return metadata

def save_metadata_as_json(metadata, save_path):
    filename = os.path.join(save_path, "combined.json")
    with open(filename, 'w') as file:
        json.dump(metadata, file, indent=4)
    print(f"Metadata saved as {filename}")

But it produces this output:

{
    "acquisition": {
        "genericMetadata": {
            "program": {
                "programName": "Auto Slice & View 4",
                "programVersion": "4.2.1.1982"
            },
            "applicationId": {
                "identifierValue": "ASV"
            },
            "fileVersion": "1.2",
            "projectName": "20200818_AlSi13 XRM tomo2",
            "numberOfCuts": "719"
        },
        "dataset": [
            {
                "acquisition": {
                    "dataset": {
                        "rows": "1",
                        "columns": "1"
                    }
                },
                "images": [
                    {
                        "acquisition": {
                            "dataset": {
                                "images": {
                                    "creationTime": "18.08.2020 17:51:07",
                                    "stage": {
                                        "workingDistance": {
                                            "value": "0.00403678"
                                        }
                                    }
                                }
                            }
                        }
                    },
                    {
                        "acquisition": {
                            "dataset": {
                                "images": {
                                    "creationTime": "18.08.2020 18:09:06",
                                    "stage": {
                                        "workingDistance": {
                                            "value": "0.00403773"
                                        }
                                    }
                                }
                            }
                        }
                    }
                ]
            },
            {
                "acquisition": {
                    "dataset": {
                        "rows": "1",
                        "columns": "1"
                    }
                },
                "images": [
                    {
                        "acquisition": {
                            "dataset": {
                                "images": {
                                    "creationTime": "18.08.2020 17:51:07",
                                    "stage": {
                                        "workingDistance": {
                                            "value": "0.00403678"
                                        }
                                    }
                                }
                            }
                        }
                    },
                    {
                        "acquisition": {
                            "dataset": {
                                "images": {
                                    "creationTime": "18.08.2020 18:09:06",
                                    "stage": {
                                        "workingDistance": {
                                            "value": "0.00403773"
                                        }
                                    }
                                }
                            }
                        }
                    }
                ]
            }
        ]
    }
}

where you can see the redundant level names I was talking about...

Conclusion

In short, I need the above dictionaries to be inputted into a function which create a nested dictionary structured like shown above. I am eventually outputting this combined dictionary as a JSON file, so if it's easier to go directly to the JSON output, then I'll take that as well.


Solution

  • you almost had it. note the nested_keys.remove(...)

    # Combine acquisition metadata
    for key, value in acquisition_metadata.items():
        nested_keys = key.split('.')
        current_dict = metadata
        
        for nested_key in nested_keys[:-1]:
            if nested_key not in current_dict:
                current_dict[nested_key] = {}
            current_dict = current_dict[nested_key]
        
        current_dict[nested_keys[-1]] = value
    
    # Combine dataset metadata
    metadata['acquisition']['dataset'] = []
    for dataset in dataset_metadata:
        dataset_dict = {}
        for key, value in dataset.items():
            nested_keys = key.split('.')
            nested_keys.remove('acquisition')
            nested_keys.remove('dataset')
            current_dict = dataset_dict
            
            for nested_key in nested_keys[:-1]:
                if nested_key not in current_dict:
                    current_dict[nested_key] = {}
                current_dict = current_dict[nested_key]
            
            current_dict[nested_keys[-1]] = value
        
        metadata['acquisition']['dataset'].append(dataset_dict)
    
    # Combine image metadata
    for i, images in enumerate(image_metadata):
        metadata['acquisition']['dataset'][i]['images'] = []
        for image in images:
            image_dict = {}
            for key, value in image.items():
                nested_keys = key.split('.')
                nested_keys.remove('acquisition')
                nested_keys.remove('dataset')
                nested_keys.remove('images')
                current_dict = image_dict
                
                for nested_key in nested_keys[:-1]:
                    if nested_key not in current_dict:
                        current_dict[nested_key] = {}
                    current_dict = current_dict[nested_key]
                
                current_dict[nested_keys[-1]] = value
            
            metadata['acquisition']['dataset'][i]['images'].append(image_dict)