Search code examples
c++boostlatin1

Tokenize latin-1 text in c++


I have mysql table with a latin text. I am trying to tokenize this text into words.

I came across boost and ICU tokenizers. The problem is these libraries expects me to figure out the word boundries.

I tried following boost code, ( with default tokenizer i.e. spaces and punctuations ).

int main(){

   using namespace std;
   using namespace boost;

   string s = "Tänk efter nu – förr'n vi föser dig bort";
   tokenizer<> tok(s);

   for(tokenizer<>::iterator beg=tok.begin(); beg!=tok.end();++beg){
       cout << *beg << "\n";
   }

   return 0;
}

It does give me the list of words. But here I am assuming the space is the correct word separator.

Considering the set of these ( http://en.wikipedia.org/wiki/ISO/IEC_8859-1#Languages_with_complete_coverage ) languages is it safe to use above code?

Or do you recon any other solution?


Solution

  • ICU has support for boundary analysis taking into account the characteristics of the text language:

    http://userguide.icu-project.org/boundaryanalysis