algorithm cryptography processing-efficiency morse-code

Morse without separators - best algorithm

In Morse code there are dots and dashes in groups of 1-4 separated by a separator. Each group means one letter. Between words there are two separators. Between sentences three.

Application for decrypting basic Morse code is quite easy to make. But my question is, how to solve the problem, when there are no separators? I know that there will be a huge amount of nonsense result but that's not my point. I only need to get all possible results in the most efficient way.

This would be an input:

......-...-..---

And this would be one of many outputs:

hello

How would you do that?

Solution

After reading a dit or dah, you have two options: terminate the letter or continue the current letter. This will lead to a lot of bifurcations in your code and a recursive approach might be a good way to implement this.

Keep a buffer of the possible string so far and print (or store) the result when you hit the string end and it coincides with the end of a letter.

Here's an implementation in C:

#include <stdlib.h>
#include <stdio.h>
#include <string.h>

static const char *letter = "**ETIANMSURWDKGOHVF*L*PJBXCYZQ**";

void morse_r(char *buf, int len, const char *str, int code)
{
    if (*str == '\0') {
        // end of string; print if staring new letter
        if (code == 1) printf("%.*s\n", len, buf);
    } else if (code < 16) {
        if (*str == '.') code = 2 * code;                    
        if (*str == '-') code = 2 * code + 1;

        // continue letter
        morse_r(buf, len, str + 1, code);

        // start new letter
        buf[len++] = letter[code];
        morse_r(buf, len, str + 1, 1);
    }
}

void morse(const char *str)
{
    char buf[strlen(str)];

    morse_r(buf, 0, str, 1);
}

int main()
{
    morse("......-...-..---");

    return 0;
}

This implementation is very simple. It uses a simplistic lookup mechanism and it doesn't check whether a letter actually exists. (The asterisks in the letter array are valid Morse codes, but they are not Latin letters.)

This approach is also rather brute force: It recalculates the tails over and over. Memoization of the tails will save a lot of extra work for the processor for loner strings.

And, as you are aware, there will be a ot of nonsense results. The above code yields 20569 strings (some of them with asterisks, i.e. invalid). You can prevent many recursions when you do a plausibility or dictionary check on your way. For example, many dots in a row will yield a lot of nonsense words with repeated Es.

Edit: As Jim Mischel points out, an explanation of how the Morse code lookup works is in order. Yves Daoust mentions a trie in his answer. A trie is a tree structure for storing words; each node can have as many children as there are letters in the alphabet. Morse code has only two letters: dit (.) and dah (-). A Morse code trie is therefore a binary tree.

Tries are usually sparse; words are rather long and many letter combinations don't exist. Morse tries are dense: Morse letter encodings are short and nearly every cobmination is used. The tree can be stored as linear, "flat" array, similar to a heap. A node is represented by its index i in the array. The left child is then 2*i and the right child 2*i + 1.

A better and more detailed explanation can be found in an answer I posted to another Morse-related question, from where I've taken the lookup code that I used in the example above.