Search code examples
pythonalgorithmpython-3.xmorse-code

How can I analyze or improve my niece's simple compression algorithm that is based on Morse code?


My 8 year old niece had a lesson on Morse code in school yesterday, and her assignment was to convert various phrases to Morse code. One of the phrases included her age, and instead of writing ---.., she wrote 3-2. because (in her words), "it's less writing that way." This rudimentary "compression algorithm" sparked my curiosity, so I wrote a bit of code to implement it.

However, we made a few changes along the way. I pointed out to her that if you wrote just .....-----, there isn't any way to tell if the writer meant 50 or eeeeettttt. In reality, there is a pause in between each letter of each word and every word so this isn't a problem, but our scheme didn't have that. I pulled out some graph paper and suggested padding the Morse code for each symbol with another symbol to facilitate coding and remove ambiguity from the scheme. My nice suggested using + because "no one ever writes those in sentences." (Ouch, I recently graduated with a math degree, but fair enough.)

Since some of us do write with +, and we all use hyphens and periods/dots, which would conflict with our standard definition of Morse code, these symbols are replaced with p, h, and d, respectively. Of course, this brings us to the problem of what to do with symbols that aren't defined in our extended Morse code. My niece wanted to simply ignore them, so that's what we did. For the sake of case sensitive preservation of the textual message, upper case letters aren't lower cased in the code; they're just carried through as is and padded with +.

Summary of the algorithm:

  1. Morse codes are right-padded to 5 characters with +
  2. We extended Morse code to substitute p for +, d for ., and h for -.
  3. Symbols that aren't defined in our "extended" Morse code are passed through intact.
  4. Runs of symbols are replaced with unless only one consecutive character occurs, in which case the number is omitted.

Potential pitfalls:

  1. My padding scheme probably reduces the effectiveness of the compression.
  2. Compression might be improved by using blocks larger than 5 characters
  3. If either my niece or I knew anything about compression algorithms, we could probably use that makes them successful.
  4. This obviously isn't suitable for production, but since there are numerous efficient compression algorithms for such purposes, I'm ignoring that problem for the time being.
  5. ???

Example:

In our algorithm, "Hello, World" translates to

H++++.++++.-..+.-..+---++,+++++++++W++++---++.-.++.-..+-..++

and compresses to

H4+.4+.-2.+.-2.+3-2+,9+W4+3-2+.-.2+.-2.+-2.2+

Here is the Python code I threw together:

#!/usr/bin/python3

import itertools
import string

class MorseString(str):
    def __init__(self, string):
        # or, pad values during iteration but this seems neater
        self._char_morse_map = {"a":".-+++", "b":"-...+", "c":"-.-.+", "d":"-..++", \
                                "e":".++++", "f":"..-.+", "g":"--.++", "h":"....+", \
                                "i":"..+++", "j":".---+", "k":"-.-++", "l":".-..+", \
                                "m":"--+++", "n":"-.+++", "o":"---++", "p":".--.+", \
                                "q":"--.-+", "r":".-.++", "s":"...++", "t":"-++++", \
                                "u":"..-++", "v":"...-+", "w":".--++", "x":"-..-+", \
                                "y":"-.--+", "z":"--..+", "1":".----", "2":"..---", \
                                "3":"...--", "4":"....-", "5":".....", "6":"-....", \
                                "7":"--...", "8":"---..", "9":"----.", "0":"-----",
                                " ":"+++++", ".":"d++++", "+":"p++++", "-":"h++++"}
        
        self._morse_char_map = dict()
        for letter, code in self._char_morse_map.items():
            self._morse_char_map[code] = letter

        self._string = string

        # convert the string to "Morse code". Could also use c.lower()
        self._morse_string = "".join([self._char_morse_map.get(c, c.ljust(5, "+")) for c in self._string])

    def compress(self):
        def grouper(n, k):
            return str(n) + k if n > 1 else k

        # could just use lambda
        return "".join([grouper(len(list(g)), k) for k, g in itertools.groupby(self._morse_string)])

    def decompress(self):
        i = 0
        start = 0
        chars = list()
        sc = self.compress()
        while i < len(sc):
            curr = sc[i]
            i += 1
            if not(curr in string.digits):
                num = 1 if start + 1 == i else int(sc[start:i-1])
                chars.append("".join(curr * num))
                start = i

        code = "".join(chars)
        chars = list()

        for i in range(0, len(code), 5):
            piece = "".join(code[i:i+5])
            chars.append(self._morse_char_map.get(piece, piece[0]))
            
        return "".join(chars)


def main():
    s0 = "Hello, World"
    ms0 = MorseString(s0)
    print(ms0._morse_string)
    print(ms0.compress())
    assert(s0 == ms0.decompress())
    
    s1 = "Hello  2 world."
    ms1 = MorseString(s1)
    assert(s1 == ms1.decompress())

    s2 = "The quick brown fox jumped over the lazy dog."
    ms2 = MorseString(s2)
    assert(s2 == ms2.decompress())

    s3 = "abc  -.+"
    ms3 = MorseString(s3)
    ms3
    assert(s3 == ms3.decompress())

if __name__ == "__main__":
    main()

What are some simple methods that would a) improve our algorithm, and b) be relatively easy to explain to my 8-year old niece? While the last point is clearly subjective, I'm nevertheless trying to indulge her curiosity as much as possible.

I welcome any improvements to the code as well since it's not structured terribly well (I'm fairly certain it's structured quite poorly, actually, but it's quick and dirty), although that's strictly for my benefit since I haven't gotten my niece to use Python (YET).

Update

Here is an updated version of the code, which attempts to incorporate both user1884905's modifications to the algorithm and Karl's improvements to the code itself.

import itertools
import string

_char_morse_map = {"a":".-", "b":"-...", "c":"-.-.", "d":"-..", \
                   "e":".", "f":"..-.", "g":"--.", "h":"....", \
                   "i":"..", "j":".---", "k":"-.-", "l":".-..", \
                   "m":"--", "n":"-.", "o":"---", "p":".--.", \
                   "q":"--.-", "r":".-.", "s":"...", "t":"-", \
                   "u":"..-", "v":"...-", "w":".--", "x":"-..-", \
                   "y":"-.--", "z":"--..", "1":".----", "2":"..---", \
                   "3":"...--", "4":"....-", "5":".....", "6":"-....", \
                   "7":"--...", "8":"---..", "9":"----.", "0":"-----",
                   " ":"", ".":"d", "+":"p", "-":"h"}

_morse_char_map = {
    code: letter
    for letter, code in _char_morse_map.items()
}


def encode(s):
    return "".join(_char_morse_map.get(c, c) + "+" for c in s)

def decode(encoded):
    return "".join(decode_gen(encoded))

def decode_gen(encoded):
    word = ""
    for c in encoded:
        if c != "+":
            word += c
        else:
            yield _morse_char_map.get(word, word) if word != "" else " "
            word = ""

def compress(s):
    def grouper(n, k):
        return str(n) + k if n > 1 else k

    return "".join(grouper(len(list(g)), k) for k, g in itertools.groupby(s))

def decompress(compressed):
    return "".join(decompress_gen(compressed))

def decompress_gen(compressed):
    digits = ""
    for c in compressed:
        if c in string.digits:
            digits += c
        else:
            number = int(digits) if digits else 1
            yield "".join(c * number)
            digits = ""

Solution

  • I would not make a class for this; as is, you re-create the mappings for each string. That could still be avoided with a class, but really it's simpler to just set up those mappings and then write a plain function to encode the string. A "morse code string" is not really fundamentally different from a plain string, and attaching the compression and decompression functions to it doesn't really make any sense. Just write a bunch of functions; worry about OOP when you have a really meaningful abstraction.

    As written, your decompression function makes no sense; compressing is not a part of decompressing, and you've combined decompression of the morse code with decoding back to a plain string in the same function. This is a mess.


    self._morse_char_map = dict()
    for letter, code in self._char_morse_map.items():
        self._morse_char_map[code] = letter
    

    This is more neatly written with a dict comprehension:

    self._morse_char_map = {
        code: letter
        for letter, code in self._char_morse_map.items()
    }
    

    "".join([...])
    

    The square brackets are unnecessary here; just feed a generator expression to join instead, and take advantage of the special syntax rule.


    chars = list()
    

    This is more neatly written chars = [], but let's try to improve this at a higher level...


    while i < len(sc):
        curr = sc[i]
        i += 1
        if not(curr in string.digits):
            num = 1 if start + 1 == i else int(sc[start:i-1])
            chars.append("".join(curr * num))
            start = i
    

    A technique: instead of setting an empty list and repeatedly accumulating things to ''.join together, write a generator and let the result be passed instead. This becomes easier when you have the logic properly separated into its own function:

    def decompress(compressed):
        return ''.join(decompress_gen(compressed))
    
    
    def decompress_gen(compressed):
        start = 0
        i = 0
        while i < len(compressed):
            curr = compressed[i]
            i += 1
            if not(curr in string.digits):
                num = 1 if start + 1 == i else int(compressed[start:i-1])
                yield "".join(curr * num)
                start = i
    

    Now, clearly we really want to just iterate over the compressed symbols with a for loop; manually incrementing an index looks really awful. To make this work, we need to look at the data a character at a time, and remember any parts of a number that we've already seen. We could do the arithmetic as we go, but let's preserve the use of int by instead building up a buffer of characters that are part of the count:

    def decompress_gen(compressed):
        number_digits = ''
        for char in compressed:
            if char in string.digits:
                number_digits += char
            else:
                number = int(number_digits) if number_digits else 1
                yield "".join(char * number)
                number_digits = ''
    

    chars = list()
    
    for i in range(0, len(code), 5):
        piece = "".join(code[i:i+5])
        chars.append(self._morse_char_map.get(piece, piece[0]))
    
    return "".join(chars)
    

    At this point, code is a string, so the ''.join is not needed in creating piece. But again, we could just be using generators (well, generator expressions) here:

    return ''.join(
        self._morse_char_map.get(piece, piece[0])
        for piece in (
            code[i: i + 5]
            for i in range(0, len(code), 5)
        )        
    )