Search code examples
pythonjsonnestedduplicateskey

How to print all duplicates key, including full paths, and optionally with values, for nested JSON in Python?


External libraries are allowed but less preferred.

Example input:

data.json with content:

{
    "name": "John",
    "age": 30,
    "address": {
        "street": "123 Main St",
        "city": "New York",
    "street": "321 Wall St"
    },
    "contacts": [
        {
            "type": "email",
            "value": "john@example.com"
        },
        {
            "type": "phone",
            "value": "555-1234"
        },
        {
            "type": "email",
            "value": "johndoe@example.com"
        }
    ],
    "age": 35
}

Example expected output:

Duplicate keys found:
  age (30, 35)
  address -> street ("123 Main St", "321 Wall St")

Using json.load/s as is returning a standard Python dictionary will remove duplicates so I think we need a way to "stream" the json as it's loading in some depth first search / visitor pattern way.

I've also tried something similar to what was suggested here: https://stackoverflow.com/a/14902564/8878330 (quoted below)

def dict_raise_on_duplicates(ordered_pairs):
    """Reject duplicate keys."""
    d = {}
    for k, v in ordered_pairs:
        if k in d:
           raise ValueError("duplicate key: %r" % (k,))
        else:
           d[k] = v
    return d

The only change I made was instead of raising, I appended the duplicate key to a list so I can print the list of duplicate keys at the end.

The problem is I don't see a simple way to get the "full path" of the duplicate keys


Solution

  • We use the object_pairs_hook argument of the json.loads method to inspect all key/value pairs within the same dictionary and check for duplicate keys. When a duplicate key is found, we modify the key name by prepending `#duplicate_key#' to it (we assume that no original key name begins with those characters). Next we recursively walk the resultant object that was just parsed from the JSON to compute the full paths of dictionary keys and print out the paths and values for the duplicates we discovered.

    import json
    
    DUPLICATE_MARKER = '#duplicate_key#'
    DUPLICATE_MARKER_LENGTH = len(DUPLICATE_MARKER)
    
    s = """{
        "name": "John",
        "age": 30,
        "address": {
            "street": "123 Main St",
            "city": "New York",
            "street": "321 Wall St"
        },
        "contacts": [
            {
                "type": "email",
                "value": "john@example.com"
            },
            {
                "type": "phone",
                "value": "555-1234"
            },
            {
                "type": "email",
                "value": "johndoe@example.com"
            }
        ],
        "age": 35
    }"""
    
    def my_hook(initial_pairs):
        s = set()
        pairs = []
        for pair in initial_pairs:
            k, v = pair
            if k in s:
                # Replace key name:
                k = DUPLICATE_MARKER + k
                pairs.append((k, v))
            else:
                s.add(k)
                pairs.append(pair)
        return dict(pairs)
    
    def get_duplicates_path(o, path):
        if isinstance(o, list):
            for i, v in enumerate(o):
                get_duplicates_path(v, f'{path}[{i}]')
        elif isinstance(o, dict):
            for k, v in o.items():
                if k[:DUPLICATE_MARKER_LENGTH] == DUPLICATE_MARKER:
                    print(f'duplicate key at {path}[{repr(k[DUPLICATE_MARKER_LENGTH:])}] with value {repr(v)}')
                else:
                    get_duplicates_path(v, f'{path}[{repr(k)}]')
    
    print(s)
    obj = json.loads(s, object_pairs_hook=my_hook)
    get_duplicates_path(obj, 'obj')
    
    print()
    
    # Another test:
    
    s = """[
       {
           "x": [{"a": 1, "b": 2, "c": 3}, {"a": 1, "b": 2, "a": 3}]
       },
       {
           "y": "z"
       }
    ]"""
    
    print(s)
    obj = json.loads(s, object_pairs_hook=my_hook)
    get_duplicates_path(obj, 'obj')
    

    Prints:

    {
        "name": "John",
        "age": 30,
        "address": {
            "street": "123 Main St",
            "city": "New York",
            "street": "321 Wall St"
        },
        "contacts": [
            {
                "type": "email",
                "value": "john@example.com"
            },
            {
                "type": "phone",
                "value": "555-1234"
            },
            {
                "type": "email",
                "value": "johndoe@example.com"
            }
        ],
        "age": 35
    }
    duplicate key at obj['address']['street'] with value '321 Wall St'
    duplicate key at obj['age'] with value 35
    
    [
       {
           "x": [{"a": 1, "b": 2, "c": 3}, {"a": 1, "b": 2, "a": 3}]
       },
       {
           "y": "z"
       }
    ]
    duplicate key at obj[0]['x'][1]['a'] with value 3