Search code examples
pythonrecursiontreecerberus

Validate a recursive data structure (e.g. tree) using Python Cerberus (v1.3.5)


What is the right way to model a recursive data structure's schema in Cerberus?

Attempt #1:

from cerberus import Validator, schema_registry
schema_registry.add("leaf", {"value": {"type": "integer", "required": True}})
schema_registry.add("tree", {"type": "dict", "anyof_schema": ["leaf", "tree"]})
v = Validator(schema = {"root": {"type": "dict", "schema": "tree"}})

Error:

cerberus.schema.SchemaError: {'root': [{
    'schema': [
        'no definitions validate', {
            'anyof definition 0': [{
                'anyof_schema': ['must be of dict type'], 
                'type': ['null value not allowed'],
            }],
            'anyof definition 1': [
                'Rules set definition tree not found.'
            ],
        },
    ]},
]}

Attempt #2:

The above error indicating the need for a rules set definition for tree:

from cerberus import Validator, schema_registry, rules_set_registry
schema_registry.add("leaf", {"value": {"type": "integer", "required": True}})
rules_set_registry.add("tree", {"type": "dict", "anyof_schema": ["leaf", "tree"]})
v = Validator(schema = {"root": {"type": "dict", "schema": "tree"}})

v.validate({"root": {"value": 1}})
v.errors
v.validate({"root": {"a": {"value": 1}}})
v.errors
v.validate({"root": {"a": {"b": {"c": {"value": 1}}}}})
v.errors

Output:

False
{'root': ['must be of dict type']}

for all 3 examples.

Expected behaviour

Ideally, I would like all the below documents to pass validation:

v = Validator(schema = {"root": {"type": "dict", "schema": "tree"}})
assert v.validate({"root": {"value": 1}}), v.errors
assert v.validate({"root": {"a": {"value": 1}}}), v.errors
assert v.validate({"root": {"a": {"b": {"c": {"value": 1}}}}}), v.errors

Related questions


Solution

  • WARNING

    The below is not a complete solution.
    If someone has a full working solution with cerberus, please share it, and I will happily mark your answer as the solution.

    Additional constraint from my actual problem

    The tree's leaves contain some keys that must match another part of the document I am validating. For this reason, I have an additional is_in validation method in my custom Validator. However, I couldn't find a good way to have a child validator for the leaves, while still keeping a reference to another part of the document at the root.

    Observation

    I have now spent more time "fighting" cerberus than it would have taken me to implement a custom input validation function, hence may try that instead for now, or try jsonschema. (EDIT: see attempt #4 below.)

    Attempt #3: cerberus custom validator

    Hopefully, the below logic can still be useful to someone.

    from cerberus import Validator
    from typing import Any
    
    
    class ManifestValidator(Validator):
        def _validate_type_tree(self: Validator, value: Any) -> bool:
            if not isinstance(value, dict):
                return False
            for v in value.values():
                if isinstance(v, dict):
                    if all(key in v for key in KEYS):
                        schema = self._resolve_schema(SCHEMA)
                        validator = self._get_child_validator(
                            document_crumb=v,
                            schema_crumb=(v, "schema"),
                            root_document=self.root_document,
                            root_schema=self.root_schema,
                            schema=schema,
                        )
                        if not validator(v, update=self.update) or validator._errors:
                            self._error(validator._errors)
                            return False
                    elif not self._validate_type_tree(v):
                        return False
                else:
                    return False
            return True
    
        def _validate_is_in(self: Validator, path: str, field: str, value: str) -> bool:
            """{'type': 'string'}"""
            document = self.root_document
            for element in path.split("."):
                if element not in document:
                    self._error(field, f"{path} does not exist in {document}")
                    return False
                document = document[element]
            if not isinstance(document, list):
                self._error(
                    field,
                    f"{path} does not point to a list but to {document} of type {type(document)}",
                )
                return False
            if value not in document:
                self._error(field, f"{value} is not present in {document} at {path}.")
                return False
            return True
    

    Attempt #4: jsonschema + custom validation logic

    from jsonschema import validate
    
    
    SCHEMA = {
        "$schema": "https://json-schema.org/draft/2020-12/schema",
        "type" : "object",
        "properties" : {
            "root": {
                "oneOf": [
                    {"$ref": "#/$defs/tree",}, 
                    {"$ref": "#/$defs/leaf",},
                ],
            },
        },
        "required": [
            "root",
        ],
        "$defs": {
            "tree": {
                "type": "object",
                "patternProperties": {
                    "^[a-z]+([_-][a-z]+)*$": {
                        "oneOf": [
                            {"$ref": "#/$defs/tree",}, 
                            {"$ref": "#/$defs/leaf",},
                        ],
                    },
                },
                "additionalProperties": False,
            },
            "leaf": {
                "type": "object",
                "properties": {
                    # In reality, the leaf is a more complex object, but as a reduction of my problem:
                    "value": {
                        "type": "number",
                    },
                },
                "required": [
                    "value",
                ],
            },
        },
    }
    
    
    TREES = [
        {"root": {"value": 1}},
        {"root": {"a": {"value": 1}}},
        {"root": {"a": {"b": {"c": {"value": 1}}}}},
        {"root": {"a-subtree": {"b-subtree": {"c-subtree": {"value": 1}}}}},
    ]
    
    
    for tree in TREES:
        validate(tree, SCHEMA)
    

    For my additional constraint (is_in), JSON pointers / JSON relative pointers / $data seem like they could be useful in simpler cases, but for what I needed, I decided to implement custom validation logic, after the jsonschema validation, which was a good first step to prove that the document is well-formed.

    Resources: