I have the following data, which I recieve via a ssh session to a switch. I wish to convert the input which is text to a dict for easy access and the possiblity to monitor certain values.
I cannot get the data extracted without a ton of splits and regexes and still get stuck.
Port : 1
Media Type : SF+_SR
Vendor Name : VENDORX
Part Number : SFP-10G-SR
Serial Number : Gxxxxxxxx
Wavelength: 850 nm
Temp (Celsius) : 37.00 Status : Normal
Low Warn Threshold : -40.00 High Warn Threshold : 85.00
Low Alarm Threshold : -50.00 High Alarm Threshold : 100.00
Voltage AUX-1/Vcc (Volts) : 3.27 Status : Normal
Low Warn Threshold : 3.10 High Warn Threshold : 3.50
Low Alarm Threshold : 3.00 High Alarm Threshold : 3.60
Tx Power (dBm) : -3.11 Status : Normal
Low Warn Threshold : -7.30 High Warn Threshold : 2.00
Low Alarm Threshold : -9.30 High Alarm Threshold : 3.00
Rx Power (dBm) : -4.68 Status : Normal
Low Warn Threshold : -11.10 High Warn Threshold : 2.00
Low Alarm Threshold : -13.10 High Alarm Threshold : 3.00
Tx Bias Current (mA): 6.27 Status : Normal
Low Warn Threshold : 0.00 High Warn Threshold : 12.00
Low Alarm Threshold : 0.00 High Alarm Threshold : 15.00
Port : 2
Media Type : SF+_SR
Vendor Name : VENDORY
Part Number : SFP-10G-SR
Serial Number : Gxxxxxxxx
Wavelength : 850 nm
Temp (Celsius) : 37.00 Status : Normal
..... etc - till port 48
Which I want to convert to:
[
{
"port": "1",
"vendor": "VENDORX",
"media_type": "SF+_SR",
"part_number": "SFP-10G-SR",
"serial_number": "Gxxxxxxxx",
"wavelength": "850 nm",
"temp": {
"value": "37.00",
"status": "normal",
# alarm threshold and warn threshold may be ignored
},
"voltage_aux-1": {
"value": "3.27",
"status": "normal",
# alarm threshold and warn threshold may be ignored
},
"tx_power": {
"value": "-3.11",
"status": "normal",
# alarm threshold and warn threshold may be ignored
},
"rx_power": {
"value": "-4.68",
"status": "normal",
# alarm threshold and warn threshold may be ignored
},
"tx_bias_current": {
"value": "6.27",
"status": "normal",
# alarm threshold and warn threshold may be ignored
},
{
"port": "2",
"vendor": "VENDORY",
"media_type": "SF+_SR",
"part_number": "SFP-10G-SR",
"serial_number": "Gxxxxxxxx",
"wavelength": "850 nm",
"temp": {
"value": "37.00",
"status": "normal",
# alarm threshold and warn threshold may be ignored
},
...... etc
}
]
Updated (Complete rewrite and simplification).
Here are some ideas for you -- adjust to taste.
The solution herein tries to avoid using "domain specific knowledge" as much as possible. The only assumptions are:
'name'
, 'threshold'
, and /...
).Ultimately, when a key has multiple values (e.g. 'port'
), then these values are put together as a list. When a key has a value that is a single dict (like for 'temp'
), then the first key of that dict (the same as the key itself) is replaced by 'value'
. Thus, we will see:
{'port': [{'port': 1, ...}, {'port': 2, ...}, ...]}
, but{'temp': {'value': 37, ...}}
.We start by splitting each line into (key, value)
pairs and note the indentation of the line. The result is a list of records, each containing: (indent, [(k0, v0), ...])
:
import re
def proc_kv(k, v):
k = re.sub(r'\(.*\)', '', k.lower())
k = re.sub(r' (?:name|threshold)', '', k)
k = re.sub(r'/\S+', '', k)
k = '_'.join(k.strip().split())
for typ in (int, float):
try:
v = typ(v)
break
except ValueError:
pass
return k, v
def proc_line(s):
s = re.sub(r'\t', ' ' * 4, s) # handle tabs if any
# split into one or more key-value pairs
p = [e.strip() for e in re.split(r':', s)]
if len(p) < 2:
return None
# if there are several pairs, use the largest space
# to split '{v[i]} {k[i+1]}'
p = [p[0]] + [
e for x in p[1:-1]
for e in x.split(max(re.split(r'( +)', x)[1::2]), maxsplit=1)
] + [p[-1]]
kv_pairs = [proc_kv(k, v) for k, v in zip(p[::2], p[1::2])]
# figure out the indentation of that line
indent = len(s) - len(s.lstrip(' '))
return indent, kv_pairs
Example on your text:
records = [r for r in [proc_line(s) for s in txt.splitlines()] if r]
>>> records
[(0, [('port', 1)]),
(4, [('media_type', 'SF+_SR')]),
(4, [('vendor', 'VENDORX')]),
(4, [('part_number', 'SFP-10G-SR')]),
(4, [('serial_number', 'Gxxxxxxxx')]),
(4, [('wavelength', '850 nm')]),
(4, [('temp', 37.0), ('status', 'Normal')]),
(10, [('low_warn', -40.0), ('high_warn', 85.0)]),
...
Note that not only keys but also values may contain spaces (e.g. 'Wavelength : 850 nm'
). We decided to use the largest space to split intermediary '{v[i] k[i+]}'
substrings. Thus:
>>> proc_line(' a b : 34 nm c d : 4 ft')
(2, [('a_b', '34 nm'), ('c_d', '4 ft')])
# but
>>> proc_line(' a b : 34 nm c d : 4 ft')
(2, [('a_b', 34), ('nm_c_d', '4 ft')])
We then construct a hierarchical representation of the records in way that takes indentation into account:
def get_blocks(records, parent=None):
indent, _ = records[0]
starts = [i for i, (o_indent, _) in enumerate(records) if o_indent == indent]
block = [] if parent is None else parent.copy()
continuation_block = len(block) > 1
for i, j in zip(starts, starts[1:] + [len(records)]):
_, kv = records[i]
continuation_block &= (single_line := i + 1 == j)
if continuation_block:
block += kv
elif single_line:
block += [(kv[0][0], kv)] if len(kv) > 1 else kv
else:
block.append((kv[0][0], get_blocks(records[i+1:j], parent=kv)))
return block
Example on the records above (obtained from your txt
):
blocks = get_blocks(records)
>>> blocks
[('port',
[('port', 1),
('media_type', 'SF+_SR'),
('vendor', 'VENDORX'),
('part_number', 'SFP-10G-SR'),
('serial_number', 'Gxxxxxxxx'),
('wavelength', '850 nm'),
('temp',
[('temp', 37.0),
...
Note the repeated first key in sub blocks (e.g. ('port', [('port', 1), ...])
and ('temp', [('temp', 37.0), ...])
.
We then transform the blocks
hierarchical structure into a dict
, with some ad-hoc logic (no clobbering (k, v)
pairs that have the same key, etc.). And finally put all the pieces together in a proc_txt()
function:
def reshape(a):
if isinstance(a, list) and len(a) == 1:
a = a[0]
if isinstance(a, dict):
a = {'value' if i == 0 else k: v for i, (k, v) in enumerate(a.items())}
return a
def to_dict(blocks):
if not isinstance(blocks, list):
return blocks
d = {}
for k, v in blocks:
d[k] = d.get(k, []) + [to_dict(v)]
return {k: reshape(v) for k, v in d.items()}
def proc_txt(txt):
records = [r for r in [proc_line(s) for s in txt.splitlines()] if r]
blocks = get_blocks(records)
d = to_dict(blocks)
return d
>>> proc_txt(txt)
{'port': [{'port': 1,
'media_type': 'SF+_SR',
'vendor': 'VENDORX',
'part_number': 'SFP-10G-SR',
'serial_number': 'Gxxxxxxxx',
'wavelength': '850 nm',
'temp': {'value': 37.0,
'status': 'Normal',
'low_warn': -40.0,
'high_warn': 85.0,
'low_alarm': -50.0,
'high_alarm': 100.0},
...
]}