Search code examples
iosobjective-ccocoanlplinguistics

Linguistic tagger incorrectly tagging as 'OtherWord'


I've been using NSLinguisticTagger with sentences and have been encountering a strange issue with sentences such as 'I am hungry' or 'I am drunk'. Whilst one would expect 'I' to be tagged as a pronoun, 'am' as a verb and 'hungry' as an adjective, they are not. Rather they are all tagged as OtherWord.

Is there something I'm doing incorrectly?

NSString *input = @"I am hungry";
NSLinguisticTaggerOptions options = NSLinguisticTaggerOmitWhitespace;
NSLinguisticTagger *tagger = [[NSLinguisticTagger alloc] initWithTagSchemes:[NSLinguisticTagger availableTagSchemesForLanguage:@"en"] options:options];
tagger.string = input;

[tagger enumerateTagsInRange:NSMakeRange(0, input.length) scheme:NSLinguisticTagSchemeNameTypeOrLexicalClass options:options usingBlock:^(NSString *tag, NSRange tokenRange, NSRange sentenceRange, BOOL *stop) {
    NSString *token = [input substringWithRange:tokenRange];
    NSString *lemma = [tagger tagAtIndex:tokenRange.location
                                  scheme:NSLinguisticTagSchemeLemma
                              tokenRange: NULL
                           sentenceRange:NULL];
    NSLog(@"%@ (%@) : %@\n", token, lemma, tag);
}];

And the output is:

I ((null)) : OtherWord
am ((null)) : OtherWord
hungry ((null)) : OtherWord

Solution

  • After quite some time in chat we found the issue:

    The sentence does not contain enough information to determine its language.

    To fix this you can either:

    add a demo sentence in your language of choice after your actual sentence. That should guarantee your preferred language gets detected.

    OR

    Tell the tagger what language to use: add the line

    [tagger setOrthography:[NSOrthography orthographyWithDominantScript:@"Latn" languageMap:@{@"Latn" : @[@"en"]}] range:NSMakeRange(0, input.length)];
    

    before the enumerate call. That way you explicitly tell the tagger what language you want the text to be in, in this case englisch (en) as part of the latin dominant language (Latn).

    If you dont know the language for sure, it may be usefull to use either of theses methods only as a fallback if the words get tagged as OtherWord meaning the language could not be detected.