How to write in python a parser for a character-based protocol

I'm implementing a client for an already existing (old) standard for exchanging information between shops and providers of some specific sector, let's say vegetables.

It must be in python, and I want my package to read a plaintext file and build some objects accessible by a 3d party application. I want to write a client, an implementation of this standard in python, and offer it open source as a library/package, and use it for my project.

It looks roughly like this (without the # comments)

I1234X9876DELIVERY # id line. 1234 is sender id and 9876 target id.
                   # Doctype "delivery"
H27082022RKG       # header line. specificy to "delivery" doctype.
                   # It will happen at 27 aug '22, at Regular time schedule. Units kg.
PAPPL0010          # Product Apple. 10 kg
PANAN0015          # Product Ananas. 15 kg
PORAN0015          # Product Orange. 15 kg

The standard has 3 types of lines: identifier, header and details or body. Header format depend on the document type of the identifier line. Body lines depend also on doc type.

Formats are defined by character-length. One character of {I, H, P, ...} at the start of the line to identify the type of line, like P. Then, if it's a product of a delivery, 4 chars to identify the type of product (APPL), and 4 digits number to specify the amount of product (10).

I thought about using a hierarchy of classes, maybe enums, to identify which kind of document I obtained, so that an application can process differently a delivery document from a catalogue document. And then, for a delivery, as the structure is known, read the date attribute, and the products array.

However, I'm not sure of:

how to parse efficiently the lines.
what to build with the parsed message.

What does it sound like to you? I didn't study computer science theory, and although I've been coding for years, it's out of the bounds I usually do. I've read an article about parsing tools for python but I'm unsure of the concepts and which tool to use, if any.

Do I need some grammar parser for this?
What would be a pythonic way to represent the data?

Thank you very much!

PS: the documents use 8-bit character encodings, usually Latin-1, so I can read byte by byte.

Solution

Looking at the start of the entry for each line would allow that line to be sent to a function for processing of that information.

This would allow for a function for each format type to allow for easier testing and maintenance.

The data could be stored in a Python dataclass. The use of enums would be possible as it looks like that is what the document is specifying.

Using enums to give more meaningful names to the abbreviations used in the format is probably a good idea.

Here is an example of do this:

from dataclasses import dataclass, field
from datetime import datetime
from enum import Enum
import re
from typing import List, Union

data = """
I1234X9876DELIVERY
H27082022RKG
PAPPL0010
PANAN0015
PORAN0015
"""


class Product(Enum):
    APPLE = "APPL"
    PINEAPPLE = "ANAN"
    ORANGE = "ORAN"


class DocType(Enum):
    UNDEFINED = "NONE"
    DELIVERY = "DELIVERY"


class DeliveryType(Enum):
    UNDEFINED = "NONE"
    REGULAR = "R"


class Units(Enum):
    UNDEFINED = "NONE"
    KILOGRAMS = "KG"


@dataclass
class LineItem:
    product: Product
    quantity: int


@dataclass
class Header:
    sender: int = 0
    target: int = 0
    doc_type: DocType = DocType.UNDEFINED


@dataclass
class DeliveryNote(Header):
    delivery_freq: DeliveryType = DeliveryType.UNDEFINED
    date: Union[datetime, None] = None
    units: Units = Units.UNDEFINED
    line_items: List[LineItem] = field(default_factory=list)

    def show(self):
        print(f"Sender: {self.sender}")
        print(f"Target: {self.target}")
        print(f"Type: {self.doc_type.name}")
        print(f"Delivery Date: {self.date.strftime('%d-%b-%Y')}")
        print(f"Deliver Type: {self.delivery_freq.name}")
        print(f"Units: {self.units.name}")
        print()
        print(f"\t|{'Item':^12}|{'Qty':^6}|")
        print(f"\t|{'-' * 12}|{'-' * 6}|")
        for entry in self.line_items:
            print(f"\t|{entry.product.name:<12}|{entry.quantity:>6}|")


def process_identifier(entry):
    match = re.match(r'(\d+)X(\d+)(\w+)', entry)
    sender, target, doc_type = match.groups()
    doc_type = DocType(doc_type)
    sender = int(sender)
    target = int(target)
    if doc_type == DocType.DELIVERY:
        doc = DeliveryNote(sender, target, doc_type)
    return doc


def process_header(entry, doc):
    match = re.match(r'(\d{8})(\w)(\w+)', entry)
    if match:
        date_str, freq, units = match.groups()
        doc.date = datetime.strptime(date_str, '%d%m%Y')
        doc.delivery_freq = DeliveryType(freq)
        doc.units = Units(units)


def process_details(entry, doc):
    match = re.match(r'(\D+)(\d+)', entry)
    if match:
        prod, qty = match.groups()
        doc.line_items.append(LineItem(Product(prod), int(qty)))


def parse_data(file_content):
    doc = None
    for line in file_content.splitlines():
        if line.startswith('I'):
            doc = process_identifier(line[1:])
        elif line.startswith('H'):
            process_header(line[1:], doc)
        elif line.startswith('P'):
            process_details(line[1:], doc)
    return doc


if __name__ == '__main__':
    this_doc = parse_data(data)
    this_doc.show()

When I ran this test it gave the following output:

$ python3 read_protocol.py 
Sender: 1234
Target: 9876
Type: DELIVERY
Delivery Date: 27-Aug-2022
Deliver Type: REGULAR
Units: KILOGRAMS

        |    Item    | Qty  |
        |------------|------|
        |APPLE       |    10|
        |PINEAPPLE   |    15|
        |ORANGE      |    15|

Hopefully that gives you some ideas as I'm sure there are lots of assumptions about your data I've got wrong.

For ease of displaying here I haven't shown reading from the file. Using Python's pathlib.read_text() should make this relatively straightforward to get data from a file.