Search code examples
parsingsplitnlperlangelixir

Elixir/Erlang - Split paragraph into sentences based on the language


In Java there is a class called BreakItterator which allows me to pass a paragraph of text in any language (the language it is written in is known) and it will split the text into separate sentences. The magic is that it can take as an argument the locale of the langue the text is written in and it will split the text according to that languages rules (if you look into it it is actually a very complex issue even in English - it is certainly not a case of 'split by full-stops/periods').

Does anybody know how I would do this in elixir? I can't find anything in a Google search.

I am almost at the point of deploying a very thin public API that does only this basic task that I can call into from elixir - but this is really not desirable.

Any help would be really appreciated.


Solution

  • i18n library should be usable for this. Just going from the examples provided, since I have no experience using it, something like the following should work (:en is the locale code):

    str = :i18n_string.from("some string")
    iter = :i18n_iterator.open(:en, :sentence)
    sentences = :i18n_string.split(iter, str)
    

    There's also Cldr, which implements a lot of locale-dependent Unicode algorithms directly in Elixir, but it doesn't seem to include iteration in particular at the moment (you may want to raise an issue there).