Search code examples
algorithmnlptext-segmentation

Word splitting statistical approach


I want to solve word splitting problem (parse words from long string with no spaces). For examle we want extract words from somelongword to [some, long, word].

We can achieve this by some dynamic approach with dictionary, but another issue we encounter is parsing ambiguity. I.e. orcore => or core or orc ore (We don't take into account phrase meaning or part of speech). So i think about usage of some statistical or ML approach.

I found that Naive Bayes and Viterbi algorithm with train set can be used for solving this. Can you point me some information about application of these algorithms to word splitting problem?

UPD: I've implemented this method on Clojure, using some advices from Peter Norvig's code


Solution

  • I think that slideshow by Peter Norvig and Sebastian Thurn is a good point to start. It presents real-world work made by google.