Python convert switch data (text) to dict

I have the following data, which I recieve via a ssh session to a switch. I wish to convert the input which is text to a dict for easy access and the possiblity to monitor certain values.

I cannot get the data extracted without a ton of splits and regexes and still get stuck.

Port :  1

    Media Type            : SF+_SR
    Vendor Name           : VENDORX
    Part Number           : SFP-10G-SR
    Serial Number         : Gxxxxxxxx
    Wavelength:             850 nm

    Temp (Celsius)            :  37.00      Status               :  Normal
          Low Warn Threshold  : -40.00      High Warn Threshold  :  85.00
          Low Alarm Threshold : -50.00      High Alarm Threshold :  100.00

    Voltage AUX-1/Vcc (Volts) :  3.27       Status               :  Normal
          Low Warn Threshold  :  3.10       High Warn Threshold  :  3.50
          Low Alarm Threshold :  3.00       High Alarm Threshold :  3.60


    Tx Power (dBm)            : -3.11       Status               :  Normal
          Low Warn Threshold  : -7.30       High Warn Threshold  :  2.00
          Low Alarm Threshold : -9.30       High Alarm Threshold :  3.00

    Rx Power (dBm)            : -4.68       Status               :  Normal
          Low Warn Threshold  : -11.10      High Warn Threshold  :  2.00
          Low Alarm Threshold : -13.10      High Alarm Threshold :  3.00

    Tx Bias Current (mA):        6.27       Status               :  Normal
          Low Warn Threshold  :  0.00       High Warn Threshold  :  12.00
          Low Alarm Threshold :  0.00       High Alarm Threshold :  15.00

Port :  2

    Media Type            : SF+_SR
    Vendor Name           : VENDORY
    Part Number           : SFP-10G-SR
    Serial Number         : Gxxxxxxxx
    Wavelength            : 850 nm

    Temp (Celsius)            :  37.00      Status               :  Normal

..... etc - till port 48

Which I want to convert to:

[
    {
       "port": "1",
       "vendor": "VENDORX",
       "media_type": "SF+_SR",
       "part_number": "SFP-10G-SR",
       "serial_number": "Gxxxxxxxx",
       "wavelength": "850 nm",
       "temp": {
           "value": "37.00",
           "status": "normal",
           # alarm threshold and warn threshold may be ignored
       },
       "voltage_aux-1": {
           "value": "3.27",
           "status": "normal",
           # alarm threshold and warn threshold may be ignored
       },
       "tx_power": {
           "value": "-3.11",
           "status": "normal",
           # alarm threshold and warn threshold may be ignored
       },
       "rx_power": {
           "value": "-4.68",
           "status": "normal",
           # alarm threshold and warn threshold may be ignored
       },
       "tx_bias_current": {
           "value": "6.27",
           "status": "normal",
           # alarm threshold and warn threshold may be ignored
       },
    {
       "port": "2",
       "vendor": "VENDORY",
       "media_type": "SF+_SR",
       "part_number": "SFP-10G-SR",
       "serial_number": "Gxxxxxxxx",
       "wavelength": "850 nm",
       "temp": {
           "value": "37.00",
           "status": "normal",
           # alarm threshold and warn threshold may be ignored
       },
       ...... etc
    }
]

Solution

Updated (Complete rewrite and simplification).

Here are some ideas for you -- adjust to taste.

The solution herein tries to avoid using "domain specific knowledge" as much as possible. The only assumptions are:

Empty lines don't matter.
Indentation is meaningful.
Keys are transformed to lowercase, and some content is removed (stuff in parentheses, 'name', 'threshold', and /...).
When a line has multiple "key : value" pairs or is followed by an indented group of lines, that is a block of information pertaining to the first key.

Ultimately, when a key has multiple values (e.g. 'port'), then these values are put together as a list. When a key has a value that is a single dict (like for 'temp'), then the first key of that dict (the same as the key itself) is replaced by 'value'. Thus, we will see:

{'port': [{'port': 1, ...}, {'port': 2, ...}, ...]}, but
{'temp': {'value': 37, ...}}.

Records

We start by splitting each line into (key, value) pairs and note the indentation of the line. The result is a list of records, each containing: (indent, [(k0, v0), ...]):

import re

def proc_kv(k, v):
    k = re.sub(r'\(.*\)', '', k.lower())
    k = re.sub(r' (?:name|threshold)', '', k)
    k = re.sub(r'/\S+', '', k)
    k = '_'.join(k.strip().split())
    for typ in (int, float):
        try:
            v = typ(v)
            break
        except ValueError:
            pass
    return k, v

def proc_line(s):
    s = re.sub(r'\t', ' ' * 4, s)  # handle tabs if any
    # split into one or more key-value pairs
    p = [e.strip() for e in re.split(r':', s)]
    if len(p) < 2:
        return None
    # if there are several pairs, use the largest space
    # to split '{v[i]} {k[i+1]}'
    p = [p[0]] + [
        e for x in p[1:-1]
        for e in x.split(max(re.split(r'( +)', x)[1::2]), maxsplit=1)
    ] + [p[-1]]
    kv_pairs = [proc_kv(k, v) for k, v in zip(p[::2], p[1::2])]
    # figure out the indentation of that line
    indent = len(s) - len(s.lstrip(' '))
    return indent, kv_pairs

Example on your text:

records = [r for r in [proc_line(s) for s in txt.splitlines()] if r]
>>> records
[(0, [('port', 1)]),
 (4, [('media_type', 'SF+_SR')]),
 (4, [('vendor', 'VENDORX')]),
 (4, [('part_number', 'SFP-10G-SR')]),
 (4, [('serial_number', 'Gxxxxxxxx')]),
 (4, [('wavelength', '850 nm')]),
 (4, [('temp', 37.0), ('status', 'Normal')]),
 (10, [('low_warn', -40.0), ('high_warn', 85.0)]),
 ...

Note that not only keys but also values may contain spaces (e.g. 'Wavelength : 850 nm'). We decided to use the largest space to split intermediary '{v[i] k[i+]}' substrings. Thus:

>>> proc_line('  a b : 34 nm  c d : 4 ft')
(2, [('a_b', '34 nm'), ('c_d', '4 ft')])

# but
>>> proc_line('  a b : 34 nm c d : 4 ft')
(2, [('a_b', 34), ('nm_c_d', '4 ft')])

Blocks

We then construct a hierarchical representation of the records in way that takes indentation into account:

def get_blocks(records, parent=None):
    indent, _ = records[0]
    starts = [i for i, (o_indent, _) in enumerate(records) if o_indent == indent]
    block = [] if parent is None else parent.copy()
    continuation_block = len(block) > 1
    for i, j in zip(starts, starts[1:] + [len(records)]):
        _, kv = records[i]
        continuation_block &= (single_line := i + 1 == j)
        if continuation_block:
            block += kv
        elif single_line:
            block += [(kv[0][0], kv)] if len(kv) > 1 else kv
        else:
            block.append((kv[0][0], get_blocks(records[i+1:j], parent=kv)))
    return block

Example on the records above (obtained from your txt):

blocks = get_blocks(records)
>>> blocks
[('port',
  [('port', 1),
   ('media_type', 'SF+_SR'),
   ('vendor', 'VENDORX'),
   ('part_number', 'SFP-10G-SR'),
   ('serial_number', 'Gxxxxxxxx'),
   ('wavelength', '850 nm'),
   ('temp',
    [('temp', 37.0),
     ...

Note the repeated first key in sub blocks (e.g. ('port', [('port', 1), ...]) and ('temp', [('temp', 37.0), ...]).

Final structure

We then transform the blocks hierarchical structure into a dict, with some ad-hoc logic (no clobbering (k, v) pairs that have the same key, etc.). And finally put all the pieces together in a proc_txt() function:

def reshape(a):
    if isinstance(a, list) and len(a) == 1:
        a = a[0]
    if isinstance(a, dict):
        a = {'value' if i == 0 else k: v for i, (k, v) in enumerate(a.items())}
    return a

def to_dict(blocks):
    if not isinstance(blocks, list):
        return blocks
    d = {}
    for k, v in blocks:
        d[k] = d.get(k, []) + [to_dict(v)]
    return {k: reshape(v) for k, v in d.items()}

def proc_txt(txt):
    records = [r for r in [proc_line(s) for s in txt.splitlines()] if r]
    blocks = get_blocks(records)
    d = to_dict(blocks)
    return d

Example on your text

>>> proc_txt(txt)
{'port': [{'port': 1,
   'media_type': 'SF+_SR',
   'vendor': 'VENDORX',
   'part_number': 'SFP-10G-SR',
   'serial_number': 'Gxxxxxxxx',
   'wavelength': '850 nm',
   'temp': {'value': 37.0,
    'status': 'Normal',
    'low_warn': -40.0,
    'high_warn': 85.0,
    'low_alarm': -50.0,
    'high_alarm': 100.0},
    ...
]}