External libraries are allowed but less preferred.
Example input:
data.json
with content:
{
"name": "John",
"age": 30,
"address": {
"street": "123 Main St",
"city": "New York",
"street": "321 Wall St"
},
"contacts": [
{
"type": "email",
"value": "john@example.com"
},
{
"type": "phone",
"value": "555-1234"
},
{
"type": "email",
"value": "johndoe@example.com"
}
],
"age": 35
}
Example expected output:
Duplicate keys found:
age (30, 35)
address -> street ("123 Main St", "321 Wall St")
Using json.load/s as is returning a standard Python dictionary will remove duplicates so I think we need a way to "stream" the json as it's loading in some depth first search / visitor pattern way.
I've also tried something similar to what was suggested here: https://stackoverflow.com/a/14902564/8878330 (quoted below)
def dict_raise_on_duplicates(ordered_pairs):
"""Reject duplicate keys."""
d = {}
for k, v in ordered_pairs:
if k in d:
raise ValueError("duplicate key: %r" % (k,))
else:
d[k] = v
return d
The only change I made was instead of raising, I appended the duplicate key to a list so I can print the list of duplicate keys at the end.
The problem is I don't see a simple way to get the "full path" of the duplicate keys
We use the object_pairs_hook argument of the json.loads
method to inspect all key/value pairs within the same dictionary and check for duplicate keys. When a duplicate key is found, we modify the key name by prepending `#duplicate_key#' to it (we assume that no original key name begins with those characters). Next we recursively walk the resultant object that was just parsed from the JSON to compute the full paths of dictionary keys and print out the paths and values for the duplicates we discovered.
import json
DUPLICATE_MARKER = '#duplicate_key#'
DUPLICATE_MARKER_LENGTH = len(DUPLICATE_MARKER)
s = """{
"name": "John",
"age": 30,
"address": {
"street": "123 Main St",
"city": "New York",
"street": "321 Wall St"
},
"contacts": [
{
"type": "email",
"value": "john@example.com"
},
{
"type": "phone",
"value": "555-1234"
},
{
"type": "email",
"value": "johndoe@example.com"
}
],
"age": 35
}"""
def my_hook(initial_pairs):
s = set()
pairs = []
for pair in initial_pairs:
k, v = pair
if k in s:
# Replace key name:
k = DUPLICATE_MARKER + k
pairs.append((k, v))
else:
s.add(k)
pairs.append(pair)
return dict(pairs)
def get_duplicates_path(o, path):
if isinstance(o, list):
for i, v in enumerate(o):
get_duplicates_path(v, f'{path}[{i}]')
elif isinstance(o, dict):
for k, v in o.items():
if k[:DUPLICATE_MARKER_LENGTH] == DUPLICATE_MARKER:
print(f'duplicate key at {path}[{repr(k[DUPLICATE_MARKER_LENGTH:])}] with value {repr(v)}')
else:
get_duplicates_path(v, f'{path}[{repr(k)}]')
print(s)
obj = json.loads(s, object_pairs_hook=my_hook)
get_duplicates_path(obj, 'obj')
print()
# Another test:
s = """[
{
"x": [{"a": 1, "b": 2, "c": 3}, {"a": 1, "b": 2, "a": 3}]
},
{
"y": "z"
}
]"""
print(s)
obj = json.loads(s, object_pairs_hook=my_hook)
get_duplicates_path(obj, 'obj')
Prints:
{
"name": "John",
"age": 30,
"address": {
"street": "123 Main St",
"city": "New York",
"street": "321 Wall St"
},
"contacts": [
{
"type": "email",
"value": "john@example.com"
},
{
"type": "phone",
"value": "555-1234"
},
{
"type": "email",
"value": "johndoe@example.com"
}
],
"age": 35
}
duplicate key at obj['address']['street'] with value '321 Wall St'
duplicate key at obj['age'] with value 35
[
{
"x": [{"a": 1, "b": 2, "c": 3}, {"a": 1, "b": 2, "a": 3}]
},
{
"y": "z"
}
]
duplicate key at obj[0]['x'][1]['a'] with value 3