Search code examples
pythonarchitecturestandardsgrammar

How to write in python a parser for a character-based protocol


I'm implementing a client for an already existing (old) standard for exchanging information between shops and providers of some specific sector, let's say vegetables.

It must be in python, and I want my package to read a plaintext file and build some objects accessible by a 3d party application. I want to write a client, an implementation of this standard in python, and offer it open source as a library/package, and use it for my project.

It looks roughly like this (without the # comments)

I1234X9876DELIVERY # id line. 1234 is sender id and 9876 target id.
                   # Doctype "delivery"
H27082022RKG       # header line. specificy to "delivery" doctype.
                   # It will happen at 27 aug '22, at Regular time schedule. Units kg.
PAPPL0010          # Product Apple. 10 kg
PANAN0015          # Product Ananas. 15 kg
PORAN0015          # Product Orange. 15 kg

The standard has 3 types of lines: identifier, header and details or body. Header format depend on the document type of the identifier line. Body lines depend also on doc type.

Formats are defined by character-length. One character of {I, H, P, ...} at the start of the line to identify the type of line, like P. Then, if it's a product of a delivery, 4 chars to identify the type of product (APPL), and 4 digits number to specify the amount of product (10).

I thought about using a hierarchy of classes, maybe enums, to identify which kind of document I obtained, so that an application can process differently a delivery document from a catalogue document. And then, for a delivery, as the structure is known, read the date attribute, and the products array.

However, I'm not sure of:

  1. how to parse efficiently the lines.
  2. what to build with the parsed message.

What does it sound like to you? I didn't study computer science theory, and although I've been coding for years, it's out of the bounds I usually do. I've read an article about parsing tools for python but I'm unsure of the concepts and which tool to use, if any.

  1. Do I need some grammar parser for this?
  2. What would be a pythonic way to represent the data?

Thank you very much!

PS: the documents use 8-bit character encodings, usually Latin-1, so I can read byte by byte.


Solution

  • Looking at the start of the entry for each line would allow that line to be sent to a function for processing of that information.

    This would allow for a function for each format type to allow for easier testing and maintenance.

    The data could be stored in a Python dataclass. The use of enums would be possible as it looks like that is what the document is specifying.

    Using enums to give more meaningful names to the abbreviations used in the format is probably a good idea.

    Here is an example of do this:

    from dataclasses import dataclass, field
    from datetime import datetime
    from enum import Enum
    import re
    from typing import List, Union
    
    data = """
    I1234X9876DELIVERY
    H27082022RKG
    PAPPL0010
    PANAN0015
    PORAN0015
    """
    
    
    class Product(Enum):
        APPLE = "APPL"
        PINEAPPLE = "ANAN"
        ORANGE = "ORAN"
    
    
    class DocType(Enum):
        UNDEFINED = "NONE"
        DELIVERY = "DELIVERY"
    
    
    class DeliveryType(Enum):
        UNDEFINED = "NONE"
        REGULAR = "R"
    
    
    class Units(Enum):
        UNDEFINED = "NONE"
        KILOGRAMS = "KG"
    
    
    @dataclass
    class LineItem:
        product: Product
        quantity: int
    
    
    @dataclass
    class Header:
        sender: int = 0
        target: int = 0
        doc_type: DocType = DocType.UNDEFINED
    
    
    @dataclass
    class DeliveryNote(Header):
        delivery_freq: DeliveryType = DeliveryType.UNDEFINED
        date: Union[datetime, None] = None
        units: Units = Units.UNDEFINED
        line_items: List[LineItem] = field(default_factory=list)
    
        def show(self):
            print(f"Sender: {self.sender}")
            print(f"Target: {self.target}")
            print(f"Type: {self.doc_type.name}")
            print(f"Delivery Date: {self.date.strftime('%d-%b-%Y')}")
            print(f"Deliver Type: {self.delivery_freq.name}")
            print(f"Units: {self.units.name}")
            print()
            print(f"\t|{'Item':^12}|{'Qty':^6}|")
            print(f"\t|{'-' * 12}|{'-' * 6}|")
            for entry in self.line_items:
                print(f"\t|{entry.product.name:<12}|{entry.quantity:>6}|")
    
    
    def process_identifier(entry):
        match = re.match(r'(\d+)X(\d+)(\w+)', entry)
        sender, target, doc_type = match.groups()
        doc_type = DocType(doc_type)
        sender = int(sender)
        target = int(target)
        if doc_type == DocType.DELIVERY:
            doc = DeliveryNote(sender, target, doc_type)
        return doc
    
    
    def process_header(entry, doc):
        match = re.match(r'(\d{8})(\w)(\w+)', entry)
        if match:
            date_str, freq, units = match.groups()
            doc.date = datetime.strptime(date_str, '%d%m%Y')
            doc.delivery_freq = DeliveryType(freq)
            doc.units = Units(units)
    
    
    def process_details(entry, doc):
        match = re.match(r'(\D+)(\d+)', entry)
        if match:
            prod, qty = match.groups()
            doc.line_items.append(LineItem(Product(prod), int(qty)))
    
    
    def parse_data(file_content):
        doc = None
        for line in file_content.splitlines():
            if line.startswith('I'):
                doc = process_identifier(line[1:])
            elif line.startswith('H'):
                process_header(line[1:], doc)
            elif line.startswith('P'):
                process_details(line[1:], doc)
        return doc
    
    
    if __name__ == '__main__':
        this_doc = parse_data(data)
        this_doc.show()
    

    When I ran this test it gave the following output:

    $ python3 read_protocol.py 
    Sender: 1234
    Target: 9876
    Type: DELIVERY
    Delivery Date: 27-Aug-2022
    Deliver Type: REGULAR
    Units: KILOGRAMS
    
            |    Item    | Qty  |
            |------------|------|
            |APPLE       |    10|
            |PINEAPPLE   |    15|
            |ORANGE      |    15|
    
    
    

    Hopefully that gives you some ideas as I'm sure there are lots of assumptions about your data I've got wrong.

    For ease of displaying here I haven't shown reading from the file. Using Python's pathlib.read_text() should make this relatively straightforward to get data from a file.