Search code examples
rubymethodstokenizelettersalphabet

Decompose words into letters with Ruby


In my language there are composite or compound letters, which consists of more than one character, eg "ty", "ny" and even "tty" and "nny". I would like to write a Ruby method (spell) which tokenize words into letters, according to this alphabet:

abc=[*%w{tty ccs lly ggy ssz nny dzs zzs sz zs cs gy ny dz ty ly q w r t z p l k j h g f d s x c v b n m y}.map{|z| [z,"c"]},*"eéuioöüóőúűáía".split(//).map{|z| [z,"v"]}].to_h

The resulting hash keys shows the existing letters / composite letters of the alphabet and also shows which letter is a consonant ("c") and which one is a vowel ("v"), becase later I would like to use this hash to decompose words into syllables. Cases of compound words when accidentally composite letters are formed at the words common boundary shoudn't be resolved by the method of course.

Examples:

spell("csobolyó") => [ "cs", "o", "b", "o", "ly", "ó" ]
spell("nyirettyű") => [ "ny", "i", "r", "e", "tty", "ű" ]
spell("dzsesszmuzsikus") => [ "dzs", "e", "ssz", "m", "u", "zs", "i", "k", "u", "s" ]

Solution

  • You might be able to get started looking at String#scan, which appears to be giving decent results for your examples:

    "csobolyó".scan(Regexp.union(abc.keys))
    # => ["cs", "o", "b", "o", "ly", "ó"]
    "nyirettyű".scan(Regexp.union(abc.keys))
    # => ["ny", "i", "r", "e", "tty", "ű"]
    "dzsesszmuzsikus".scan(Regexp.union(abc.keys))
    # => ["dzs", "e", "ssz", "m", "u", "zs", "i", "k", "u", "s"]
    

    The last case doesn't match your expected output, but it matches your statement in the comments

    I sorted the letters in the alphabet: if a letter appears earlier, then it should be recognized instead of its simple letters. When a word contains "dzs" it should be considered to "dzs" and not to "d" and "zs"