Search code examples
pythonannotationstype-hintingpython-typingpython-3.11

Type-Hinting JSON Object Rigorously


I have a function in my program that takes a JSON as one of its parameters, and I wanted to annotate my function signature as such. I recalled that the loads() function from the built-in json module takes in a str, bytes, or a bytearray (which aren't even annotated themselves), but the return type on the module's function isn't type hinted. As such, I'm unsure of how to go about type-hinting my params.

The json.loads() function in question:

def loads(s, *, cls=None, object_hook=None, parse_float=None,
        parse_int=None, parse_constant=None, object_pairs_hook=None, **kw):
    # function code is here

I felt like simply using typing.Any was a bit too vague for the purpose of my function, as it demands a specific structure to execute properly (I'm literally pulling in a dictionary within a dictionarywithin a list within a dictionary).

My function in question:

def get_order_list(json_response: <type desired>, *, time_r: int = 60, buyers: bool = False) -> list[int]:
    # code

I then set about trying to rigorously annotate the one parameter as such:

def get_order_list(json_response: list[dict[str, None | str | bool | int | dict[str, str | int | bool]]], *, time_r: int = 60, buyers: bool = False) -> list[int]:
    # code

But this just looks overcomplicated and ugly. So now I'm at a crossroads. Should I settle with typing.Any, or should I use my convoluted mess of type-hinting?


One more thing to mention, would it be wise to document the structure in which the json input should be? If so, should that be within the docstring of the function, the docstring of the file itself, or within a separate file?


Solution

  • Should I settle with typing.Any, or should I use my convoluted mess of type-hinting?

    I think you should avoid Any as much as possible (see these recommendations also). It is too often used as a cop-out of actually thinking about your data types. I also don't think your type hint is a convoluted mess. It is just formatted and structured sub-optimally in my opinion.

    At the minimum, you could use line breaks and indentation for improved readability:

    def get_order_list(
        json_response: list[
            dict[
                str,
                None | str | bool | int | dict[
                    str,
                    str | int | bool
                ]
            ]
        ],
        *,
        time_r: int = 60,
        buyers: bool = False,
    ) -> list[int]:
        ...
    

    Aside from that, you should definitely remove redundancies in your type unions. An example is using a union of types S | T, when S is a subtype of T. In that case just use T.

    bool is a subtype of int (check it with issubclass), so you can drop the bool from the unions:

    def get_order_list(
        json_response: list[
            dict[
                str,
                None | str | int | dict[str, str | int]
            ]
        ],
        *,
        time_r: int = 60,
        buyers: bool = False,
    ) -> list[int]:
        ...
    

    Then you might want to use (explicit) type aliases to abstract some of the details a bit and reduce repetition. I would also suggest to always place None last in a type union, but this may just be my preference:

    from typing import TypeAlias
    
    StrInt: TypeAlias = str | int
    
    
    def get_order_list(
        json_response: list[
            dict[
                str,
                StrInt | dict[str, StrInt] | None
            ]
        ],
        *,
        time_r: int = 60,
        buyers: bool = False,
    ) -> list[int]:
        ...
    

    And at best, those nested types actually get some semantic names/aliases. Here you did not yet provide enough information, so the best I can do is provide some generic names like this:

    from typing import TypeAlias
    
    StrInt: TypeAlias = str | int
    ValueType: TypeAlias = StrInt | dict[str, StrInt] | None
    ItemType: TypeAlias = dict[str, ValueType]
    
    
    def get_order_list(
        json_response: list[ItemType],
        *,
        time_r: int = 60,
        buyers: bool = False,
    ) -> list[int]:
        ...
    

    But if you can name those appropriately, not only will this be more concise in the actual function signature, but it will also be much clearer to anyone reading the code, what those types actually mean. Which leads into your second question.


    would it be wise to document the structure in which the json input should be?

    The better you name things in your code, the less comments and documentation you will need. Obviously documentation is necessary, but you will be able to leverage the names you already defined to make it much more concise and precise.

    In this case, you will be able to refer to some of your type aliases directly, writing things like this:

    json_response should be a list of "items"; an item is a dictionary with string-keys, where each value is a "value"; a value can be either a string, an integer, a dictionary of strings or integers or None.

    I am again deliberately using generic terms like "item" and "value", for which you can substitute something with actual meaning in your specific context.

    Lastly, I would advise not calling that parameter json_response in the first place because that is a confusing and frankly incorrect name for what that object will contain. JSON as you know is just a specific text format. It would make sense to call that parameter this way, if you expected the argument to be a str of actual JSON.

    But you don't expect that. Your function expects built-in Python types that you can lay out very precisely (as above). And your function is (and should be) agnostic about where those objects come from - a parsed JSON file or anywhere else. It would work the same, regardless of where that list of dictionaries etc. came from. Therefore the "JSON" in the parameter name is simply misleading and unnecessary. "Response" may be sensible in terms of meaning, again I don't know the context.

    And yes, I would suggest always fully documenting your parameters (and their more complex types if necessary) in the function's docstring. This does not mean you should avoid mentioning that somewhere else as well, but the docstring should explain everything one needs to know to use that function.

    In summary, I would probably write it something like this:

    from typing import TypeAlias
    
    StrInt: TypeAlias = str | int
    ValueType: TypeAlias = StrInt | dict[str, StrInt] | None
    ItemType: TypeAlias = dict[str, ValueType]
    
    
    def get_order_list(
        response: list[ItemType],
        *,
        time_r: int = 60,
        buyers: bool = False,
    ) -> list[int]:
        """
        Does X, Y, and Z and returns a list of ...
        
        [Some additional context/info for when/how this function should be used,
        expanding on the first sentence of the docstring,
        which should be a one-liner.] 
    
        Args:
            response:
                a list of "items"; an item should be a dictionary with string-keys,
                where each value is a "value"; a value can be either a string,
                an integer, a dictionary of strings or integers or `None`
            time_r:
                time for ...; 60 by default
            buyers:
                Whether or not ...; `False` by default
        
        Returns:
            List of integers, where each one represents ...
        """
        ...
    

    I have omitted a lot of "in general", "typically", and "my advice would be" clauses in my answer here. But I think it goes without saying that most of this is at least somewhat subjective or conventional.


    PS:

    All that being said, I am a big fan of Pydantic and agree completely with Roland in the comments that it seems like yours is a prime use case for it. Especially if the objects involved mostly follow a certain schema, you could define models for what I called ValueType and ItemType instead to encapsulate their meaning even better.

    Not to mention it offers validation and provides much better type safety overall.

    But I think most of what I wrote above still applies regardless of whether you use Pydantic models or built-in types.