Search code examples
pythonjsonindentationpretty-print

How to detect and indent json substrings inside longer non-json text?


I have an existing Python application, which logs like:

import logging
import json
logger = logging.getLogger()

some_var = 'abc'
data = {
   1: 2,
   'blah': {
      ['hello']
   }
}

logger.info(f"The value of some_var is {some_var} and data is {json.dumps(data)}")

So the logger.info function is given:

The value of some_var is abc and data is {1: 2,"blah": {["hello"]}}

Currently my logs go to AWS CloudWatch, which does some magic and renders this with indentation like:

The value of some_var is abc and data is {
   1: 2,
   "blah": {
      ["hello"]
   }
}

This makes the logs super clear to read.

Now I want to make some changes to my logging, handling it myself with another python script that wraps around my code and emails out logs when there's a failure.

What I want is some way of taking each log entry (or a stream/list of entries), and applying this indentation.

So I want a function which takes in a string, and detects which subset(s) of that string are json, then inserts \n and to pretty-print that json.

example input:

Hello, {"a": {"b": "c"}} is some json data, but also {"c": [1,2,3]} is too

example output

Hello, 
{
  "a": {
    "b": "c"
  }
} 
is some json data, but also 
{
  "c": [
    1,
    2,
    3
  ]
}
is too

I have considered splitting up each entry into everything before and after the first {. Leave the left half as is, and pass the right half to json.dumps(json.loads(x), indent=4).

But what if there's stuff after the json object in the log file? Ok, we can just select everything after the first { and before the last }. Then pass the middle bit to the JSON library.

But what if there's two JSON objects in this log entry? (Like in the above example.) We'll have to use a stack to figure out whether any { appears after all prior { have been closed with a corresponding }.

But what if there's something like {"a": "\}"}. Hmm, ok we need to handle escaping. Now I find myself having to write a whole json parser from scratch.

Is there any easy way to do this?

I suppose I could use a regex to replace every instance of json.dumps(x) in my whole repo with json.dumps(x, indent=4). But json.dumps is sometimes used outside logging statements, and it just makes all my logging lines that extra bit longer. Is there a neat elegant solution?

(Bonus points if it can parse and indent the json-like output that str(x) produces in python. That's basically json with single quotes instead of double.)


Solution

  • In order to extract JSON objects from a string, see this answer. The extract_json_objects() function from that answer will handle JSON objects, and nested JSON objects but nothing else. If you have a list in your log outside of a JSON object, it's not going to be picked up.

    In your case, modify the function to also return the strings/text around all the JSON objects, so that you can put them all into the log together (or replace the logline):

    from json import JSONDecoder
    
    def extract_json_objects(text, decoder=JSONDecoder()):
        pos = 0
        while True:
            match = text.find('{', pos)
            if match == -1:
                yield text[pos:]  # return the remaining text
                break
            yield text[pos:match]  # modification for the non-JSON parts
            try:
                result, index = decoder.raw_decode(text[match:])
                yield result
                pos = match + index
            except ValueError:
                pos = match + 1
    

    Use that function to process your loglines, add them to a list of strings, which you then join together to produce a single string for your output, logger, etc.:

    def jsonify_logline(line):
        line_parts = []
        for result in extract_json_objects(line):
            if isinstance(result, dict):  # got a JSON obj
                line_parts.append(json.dumps(result, indent=4))
            else:                         # got text/non-JSON-obj
                line_parts.append(result)
        # (don't make that a list comprehension, quite un-readable)
    
        return ''.join(line_parts)
    

    Example:

    >>> demo_text = """Hello, {"a": {"b": "c"}} is some json data, but also {"c": [1,2,3]} is too"""
    >>> print(jsonify_logline(demo_text))
    Hello, {
        "a": {
            "b": "c"
        }
    } is some json data, but also {
        "c": [
            1,
            2,
            3
        ]
    } is too
    >>>
    

    Other things not directly related which would have helped:

    • Instead of using json.dumps(x) for all your log lines, following the DRY principle and create a function like logdump(x) which does whatever you'd want to do, like json.dumps(x), or json.dumps(x, indent=4), or jsonify_logline(x). That way, if you needed to change the JSON format for all your logs, you just change that one function; no need for mass "search & replace", which comes with its own issues and edge-cases.
      • You can even add an optional parameter to it pretty=True to decide if you want it indented or not.
    • You could mass search & replace all your existing loglines to do logger.blah(jsonify_logline(<previous log f-string or text>))
    • If you are JSON-dumping custom objects/class instances, then use their __str__ method to always output pretty-printed JSON. And the __repr__ to be non-pretty/compact.
      • Then you wouldn't need to modify the logline at all. Doing logger.info(f'here is my object {x}') would directly invoke obj.__str__.