Search code examples
pythondata-conversion

Python convert switch data (text) to dict


I have the following data, which I recieve via a ssh session to a switch. I wish to convert the input which is text to a dict for easy access and the possiblity to monitor certain values.

I cannot get the data extracted without a ton of splits and regexes and still get stuck.

Port :  1

    Media Type            : SF+_SR
    Vendor Name           : VENDORX
    Part Number           : SFP-10G-SR
    Serial Number         : Gxxxxxxxx
    Wavelength:             850 nm

    Temp (Celsius)            :  37.00      Status               :  Normal
          Low Warn Threshold  : -40.00      High Warn Threshold  :  85.00
          Low Alarm Threshold : -50.00      High Alarm Threshold :  100.00

    Voltage AUX-1/Vcc (Volts) :  3.27       Status               :  Normal
          Low Warn Threshold  :  3.10       High Warn Threshold  :  3.50
          Low Alarm Threshold :  3.00       High Alarm Threshold :  3.60


    Tx Power (dBm)            : -3.11       Status               :  Normal
          Low Warn Threshold  : -7.30       High Warn Threshold  :  2.00
          Low Alarm Threshold : -9.30       High Alarm Threshold :  3.00

    Rx Power (dBm)            : -4.68       Status               :  Normal
          Low Warn Threshold  : -11.10      High Warn Threshold  :  2.00
          Low Alarm Threshold : -13.10      High Alarm Threshold :  3.00

    Tx Bias Current (mA):        6.27       Status               :  Normal
          Low Warn Threshold  :  0.00       High Warn Threshold  :  12.00
          Low Alarm Threshold :  0.00       High Alarm Threshold :  15.00

Port :  2

    Media Type            : SF+_SR
    Vendor Name           : VENDORY
    Part Number           : SFP-10G-SR
    Serial Number         : Gxxxxxxxx
    Wavelength            : 850 nm

    Temp (Celsius)            :  37.00      Status               :  Normal

..... etc - till port 48

Which I want to convert to:

[
    {
       "port": "1",
       "vendor": "VENDORX",
       "media_type": "SF+_SR",
       "part_number": "SFP-10G-SR",
       "serial_number": "Gxxxxxxxx",
       "wavelength": "850 nm",
       "temp": {
           "value": "37.00",
           "status": "normal",
           # alarm threshold and warn threshold may be ignored
       },
       "voltage_aux-1": {
           "value": "3.27",
           "status": "normal",
           # alarm threshold and warn threshold may be ignored
       },
       "tx_power": {
           "value": "-3.11",
           "status": "normal",
           # alarm threshold and warn threshold may be ignored
       },
       "rx_power": {
           "value": "-4.68",
           "status": "normal",
           # alarm threshold and warn threshold may be ignored
       },
       "tx_bias_current": {
           "value": "6.27",
           "status": "normal",
           # alarm threshold and warn threshold may be ignored
       },
    {
       "port": "2",
       "vendor": "VENDORY",
       "media_type": "SF+_SR",
       "part_number": "SFP-10G-SR",
       "serial_number": "Gxxxxxxxx",
       "wavelength": "850 nm",
       "temp": {
           "value": "37.00",
           "status": "normal",
           # alarm threshold and warn threshold may be ignored
       },
       ...... etc
    }
]

Solution

  • Updated (Complete rewrite and simplification).

    Here are some ideas for you -- adjust to taste.

    The solution herein tries to avoid using "domain specific knowledge" as much as possible. The only assumptions are:

    1. Empty lines don't matter.
    2. Indentation is meaningful.
    3. Keys are transformed to lowercase, and some content is removed (stuff in parentheses, 'name', 'threshold', and /...).
    4. When a line has multiple "key : value" pairs or is followed by an indented group of lines, that is a block of information pertaining to the first key.

    Ultimately, when a key has multiple values (e.g. 'port'), then these values are put together as a list. When a key has a value that is a single dict (like for 'temp'), then the first key of that dict (the same as the key itself) is replaced by 'value'. Thus, we will see:

    • {'port': [{'port': 1, ...}, {'port': 2, ...}, ...]}, but
    • {'temp': {'value': 37, ...}}.

    Records

    We start by splitting each line into (key, value) pairs and note the indentation of the line. The result is a list of records, each containing: (indent, [(k0, v0), ...]):

    import re
    
    def proc_kv(k, v):
        k = re.sub(r'\(.*\)', '', k.lower())
        k = re.sub(r' (?:name|threshold)', '', k)
        k = re.sub(r'/\S+', '', k)
        k = '_'.join(k.strip().split())
        for typ in (int, float):
            try:
                v = typ(v)
                break
            except ValueError:
                pass
        return k, v
    
    def proc_line(s):
        s = re.sub(r'\t', ' ' * 4, s)  # handle tabs if any
        # split into one or more key-value pairs
        p = [e.strip() for e in re.split(r':', s)]
        if len(p) < 2:
            return None
        # if there are several pairs, use the largest space
        # to split '{v[i]} {k[i+1]}'
        p = [p[0]] + [
            e for x in p[1:-1]
            for e in x.split(max(re.split(r'( +)', x)[1::2]), maxsplit=1)
        ] + [p[-1]]
        kv_pairs = [proc_kv(k, v) for k, v in zip(p[::2], p[1::2])]
        # figure out the indentation of that line
        indent = len(s) - len(s.lstrip(' '))
        return indent, kv_pairs
    

    Example on your text:

    records = [r for r in [proc_line(s) for s in txt.splitlines()] if r]
    >>> records
    [(0, [('port', 1)]),
     (4, [('media_type', 'SF+_SR')]),
     (4, [('vendor', 'VENDORX')]),
     (4, [('part_number', 'SFP-10G-SR')]),
     (4, [('serial_number', 'Gxxxxxxxx')]),
     (4, [('wavelength', '850 nm')]),
     (4, [('temp', 37.0), ('status', 'Normal')]),
     (10, [('low_warn', -40.0), ('high_warn', 85.0)]),
     ...
    

    Note that not only keys but also values may contain spaces (e.g. 'Wavelength : 850 nm'). We decided to use the largest space to split intermediary '{v[i] k[i+]}' substrings. Thus:

    >>> proc_line('  a b : 34 nm  c d : 4 ft')
    (2, [('a_b', '34 nm'), ('c_d', '4 ft')])
    
    # but
    >>> proc_line('  a b : 34 nm c d : 4 ft')
    (2, [('a_b', 34), ('nm_c_d', '4 ft')])
    

    Blocks

    We then construct a hierarchical representation of the records in way that takes indentation into account:

    def get_blocks(records, parent=None):
        indent, _ = records[0]
        starts = [i for i, (o_indent, _) in enumerate(records) if o_indent == indent]
        block = [] if parent is None else parent.copy()
        continuation_block = len(block) > 1
        for i, j in zip(starts, starts[1:] + [len(records)]):
            _, kv = records[i]
            continuation_block &= (single_line := i + 1 == j)
            if continuation_block:
                block += kv
            elif single_line:
                block += [(kv[0][0], kv)] if len(kv) > 1 else kv
            else:
                block.append((kv[0][0], get_blocks(records[i+1:j], parent=kv)))
        return block
    

    Example on the records above (obtained from your txt):

    blocks = get_blocks(records)
    >>> blocks
    [('port',
      [('port', 1),
       ('media_type', 'SF+_SR'),
       ('vendor', 'VENDORX'),
       ('part_number', 'SFP-10G-SR'),
       ('serial_number', 'Gxxxxxxxx'),
       ('wavelength', '850 nm'),
       ('temp',
        [('temp', 37.0),
         ...
    

    Note the repeated first key in sub blocks (e.g. ('port', [('port', 1), ...]) and ('temp', [('temp', 37.0), ...]).

    Final structure

    We then transform the blocks hierarchical structure into a dict, with some ad-hoc logic (no clobbering (k, v) pairs that have the same key, etc.). And finally put all the pieces together in a proc_txt() function:

    def reshape(a):
        if isinstance(a, list) and len(a) == 1:
            a = a[0]
        if isinstance(a, dict):
            a = {'value' if i == 0 else k: v for i, (k, v) in enumerate(a.items())}
        return a
    
    def to_dict(blocks):
        if not isinstance(blocks, list):
            return blocks
        d = {}
        for k, v in blocks:
            d[k] = d.get(k, []) + [to_dict(v)]
        return {k: reshape(v) for k, v in d.items()}
    
    def proc_txt(txt):
        records = [r for r in [proc_line(s) for s in txt.splitlines()] if r]
        blocks = get_blocks(records)
        d = to_dict(blocks)
        return d
    

    Example on your text

    >>> proc_txt(txt)
    {'port': [{'port': 1,
       'media_type': 'SF+_SR',
       'vendor': 'VENDORX',
       'part_number': 'SFP-10G-SR',
       'serial_number': 'Gxxxxxxxx',
       'wavelength': '850 nm',
       'temp': {'value': 37.0,
        'status': 'Normal',
        'low_warn': -40.0,
        'high_warn': 85.0,
        'low_alarm': -50.0,
        'high_alarm': 100.0},
        ...
    ]}