Search code examples
textlanguage-agnosticwekatokenize

How to mix Weka Tokenizer results


I have texts in English and Spanish and I want to tokenize every one separately using Weka and then merge both results in one output.

If I copy the English's attributes and then the Spanish's ones for example, in the same way the data generated for the both experiments, the indexes for the Spanish's attributes (in Spanish data) will be pointed to English ones.

If I mix the texts. I don't know how many attributes of each language will be generated (I want to have the same number of attributes of each language).

Exist any way, in Weka, to mix both results in the same output having the same number of attributes of each language? or exist a mode to configure the Tokenizer's dictionary for it uses my own one?

Thanks in advance.


Solution

  • You could build a hierarchical model.

    On level-1, you build a separate and independent model for each language, whatever tokens they have (different ones). Then you output their prediction probabilities for any of your final classes, as a mapping between texts (any language) and classes (common to the final task, could be intermediate features).

    Use these common classes to build a level-2 model in which you map their predictions to your final classes.