Search code examples
pythonjsondictionaryparsingtext-parsing

Parse plain text API response into JSON using Python


An API endpoint that I am using for a project returns a plain text response of the form:

[RESPONSE]
code = 200
description = Command completed successfully
queuetime = 0
runtime = 0.071
property[abuse policy][0] = The policies are published at the REGISTRY_OPERATOR website at:
property[abuse policy][1] = =>https://registry.in/Policies 
property[abuse policy][2] = 
property[abuse policy][3] = IN Policy Framework: https://registry.in/system/files/inpolicy_0.pdf
property[abuse policy][4] = IN Domain Anti-Abuse policy: https://registry.in/Policies/IN_Anti_Abuse_Policy
property[abuse policy url][0] = https://registry.in/Policies/IN_Anti_Abuse_Policy
property[active][0] = 0

And I am attempting to parse this into a dictionary using Python. Currently, I have the following code:

import re
def text_to_dict(text):
    js = {}
    for s in text.splitlines():
        x = s.split("=", maxsplit=1)
        if len(x) > 1:
            keys = [k for i in re.split("\]|\[", x[0]) if (k := i.strip())]
            for i, k in enumerate(keys):
                pd = js
                for j,pk in enumerate(keys[:i]):
                    if keys[j+1:j+2] and not (keys[j+1:j+2][0]).isnumeric():
                        pd = pd[pk]
                if k not in pd:
                    if k.isnumeric():
                        pd[keys[i-1]].append((x[1]).strip())
                    else:
                        pd[k] = (x[1]).strip() if i == len(keys)-1 else [] if keys[i+1:i+2] and (keys[i+1:i+2][0]).isnumeric() else {}
    return js

This code can handle the above example, and it returns:

{
    "code": "200",
    "description": "Command completed successfully",
    "runtime": "0.081",
    "queuetime": "0",
    "property": {
        "abuse policy": [
            "The policies are published at the REGISTRY_OPERATOR website at:",
            "=>https://registry.in/Policies",
            "",
            "IN Policy Framework: https://registry.in/system/files/inpolicy_0.pdf",
            "IN Domain Anti-Abuse policy: https://registry.in/Policies/IN_Anti_Abuse_Policy"
        ],
        "abuse policy url": [
            "https://registry.in/Policies/IN_Anti_Abuse_Policy"
        ],
        "active": [
            "0"
        ]
    }
}

However, it cannot handle the following if I append it to the example above:

...
property[active][1][test] = TEST

or

...
property[active][1][0] = TEST

which should return

{
    ...
    "active": [
            "0",
            {"test": "TEST"}
        ]
}

and

{
    ...
    "active": [
            "0",
            ["TEST"]
        ]
}

respectively.

I feel like there is an easier way of accounting for all possibilities without writing a bunch of nested ifs, but I'm not sure what is is.


Solution

  • Your input data is practically in INI file format. Python has the configparser module for convenience.

    When we presume that every part of the key 'property[foo][0][test]' actually is a dict key (no nested lists), we would parse that into this structure:

    {'property': {'foo': {'0': {'test': 'value'}}}}
    

    which can be done with a loop that keeps creating nested dicts:

    from configparser import ConfigParser
    
    def parse(text):
        config = ConfigParser()
        config.read_string(text)
    
        root = {}
        for key in config['RESPONSE'].keys():
            curr = root
            for key_part in key.replace(']', '').split('['):
                if key_part not in curr:
                    curr[key_part] = {}
                prev = curr
                curr = curr[key_part]
            prev[key_part] = config['RESPONSE'][key]
        return root
    

    usage

    from pprint import pprint
    
    text = """
    [RESPONSE]
    code = 200
    description = Command completed successfully
    queuetime = 0
    runtime = 0.071
    property[abuse policy][0] = The policies are published at the REGISTRY_OPERATOR website at:
    property[abuse policy][1] = =>https://registry.in/Policies 
    property[abuse policy][2] = 
    property[abuse policy][3] = IN Policy Framework: https://registry.in/system/files/inpolicy_0.pdf
    property[abuse policy][4] = IN Domain Anti-Abuse policy: https://registry.in/Policies/IN_Anti_Abuse_Policy
    property[abuse policy url][0] = https://registry.in/Policies/IN_Anti_Abuse_Policy
    property[active][0] = 0
    property[foo][0][test] = a
    property[foo][1][test] = b
    property[bar][0][0] = A
    property[bar][1][1] = B
    """
    
    pprint(parse(text))
    

    result

    {'code': '200',
     'description': 'Command completed successfully',
     'property': {'abuse policy': {'0': 'The policies are published at the '
                                        'REGISTRY_OPERATOR website at:',
                                   '1': '=>https://registry.in/Policies',
                                   '2': '',
                                   '3': 'IN Policy Framework: '
                                        'https://registry.in/system/files/inpolicy_0.pdf',
                                   '4': 'IN Domain Anti-Abuse policy: '
                                        'https://registry.in/Policies/IN_Anti_Abuse_Policy'},
                  'abuse policy url': {'0': 'https://registry.in/Policies/IN_Anti_Abuse_Policy'},
                  'active': {'0': '0'},
                  'bar': {'0': {'0': 'A'}, '1': {'1': 'B'}},
                  'foo': {'0': {'test': 'a'}, '1': {'test': 'b'}}},
     'queuetime': '0',
     'runtime': '0.071'}
    

    You could check if key_part is numeric, and convert it to int so the resulting structure behaves more like it contained lists, i.e.

    {'property': {'foo': {0: {'test': 'value'}}}}