I'm implementing readability test and have implemented simple algorithm of detecting sylables. Detecting sequences of vowels I'm counting them in words, for example word "shoud" contains one sequence of vowels which is 'ou'. Before I'm counting them i'm removing suffixes like -les, -e, -ed (for example word "like" contains one syllable but two sequences of vowels, so this method works).
But... Consider these words / sequences:
What to do with special characters? Remove them all? It will be ok for most of words, but not with "n'" and "x-ray". And how treat cyphers.
These are special cases of words but I'll be very glad to see some experience or ideas in this subject.
I'd advise you to first determine how much of your data consists of these kinds of words and how much it matters to your program's overall performance. Also compile some statistics of which kinds occur most.
There's no simple correct solution for this problem, but I can suggest a few heuristics:
'
between two consonants (shouldn't
) seems to mark the elision of a syllable'
with a vowel or word boundary on one side (I'd
, goin'
) seems not to do so (but note that goin'
is still two syllables)n'
is at least one syllable long-
) may be handled by treating the text on both sides as separate words3rd
can be handled by code that writes ordinals out as words, or by simpler heuristics.