Search code examples
parsingunicodecharacter-encodingstring-parsingpunctuation

Are there character collections for all international full stop punctuations?


I am trying to parse utf-8 strings into "bite sized" segments. For example, I would like to break down a text into "sentences".

Is there a comprehensive collection of characters (or regex) that correspond to end of sentences in all languages? I'm looking for something that would capture the Latin period, exclamation and interrogation marks, the Chinese and Japanese full stop, etc.

Something like the above but for the equivalent of a comma would be great too.


Solution

  • I haven’t encountered any compilations of such information, and I would expect it to be a major effort to collect it. For some widely used languages, you could get the information from The Chicago Manual of Style. There is some information about punctuation marks commonly used in different languages at http://unicode.org/repos/cldr-tmp/trunk/diff/by_type/misc.exemplarCharacters-other.html but is covers just a small set of languages and does not distinguish sentence-terminating characters.

    Using just characters won’t be enough, since e.g. in English, the full stop “.” occurs in many contexts where it does not terminate a sentence, as in “e.g.” or in “1.5”.