Using the NLTagger class, I'm wondering if anyone can recommend the most straightforward way to enumerate through the tagged tokens in a given text, but pulling out multiple tag types per token. For example, to enumerate the words in a given text, pulling out (lemma, lexical category) for each.
It seems that the enumerateTags() method and associated NLTag class have the limitation of only reporting one particular tag type per enumeration. So I can achieve what I want by making multiple passes over the text, e.g. pulling out the string ranges that match given criteria on the first pass and then matching things up on later passes. For example, I could lemmatise all of the nouns and verbs like this:
let tagger = NLTagger(tagSchemes: [.lemma, .nameTypeOrLexicalClass])
tagger.string = //some text
let keyWordCategories: [NLTag] = [.noun, .verb]
let options: NLTagger.Options = [.omitPunctuation, .omitWhitespace, .joinNames]
// In the first pass, we're going to record which ranges are of categories we're interested in
var keywordRanges = Set<Range<String.Index>>(minimumCapacity: 200)
// First pass: which are the nouns and verbs?
tagger.enumerateTags(in: text.startIndex..<text.endIndex, unit: .word, scheme: .nameTypeOrLexicalClass, options: options) { tag, tokenRange in
if let tag = tag {
if (keyWordCategories.contains(tag)) {
keywordRanges.insert(tokenRange)
}
}
return true
}
// Second pass: lemmatise, filtering on just the nouns and verbs
tagger.enumerateTags(in: text.startIndex..<text.endIndex, unit: .word, scheme: .lemma, options: options) { tag, tokenRange in
if let tag = tag {
if (keywordRanges.contains(tokenRange)) {
lemmas.insert(tag.rawValue)
}
}
return true
}
This mechanism achieves the desired functionality, but strikes me as a somewhat clumsy and potentially inefficient way to have to go about things. I would have expected to be able to enumerate (lemma, lexical category) in a single pass. I'm assuming that the NLTagger instance caches things behind the scenes so that it's not as terrible as it looks in terms of efficiency. But it's still far from ideal in terms of simplicity of the code. Can anyone more familiar with this API advise on whether this is really the intended pattern?
You could use tags(in:unit:scheme:options:) to obtain lemmas in concrete range, instead of iterating through each lemma of tagger:
let tagger = NLTagger(tagSchemes: [.lemma, .nameTypeOrLexicalClass])
tagger.string = text
let options: NLTagger.Options = [.omitPunctuation, .omitWhitespace, .joinNames]
let keyWordCategories = Set<NLTag>(arrayLiteral: .noun, .verb)
var lemmas = Set<String>()
let unit: NLTokenUnit = .word
tagger.enumerateTags(in: text.startIndex..<text.endIndex, unit: unit, scheme: .nameTypeOrLexicalClass, options: options) { tag, tokenRange in
if tag.map(keyWordCategories.contains) == true {
if let lemma = tagger.tags(in: tokenRange, unit: unit, scheme: .lemma, options: options).first?.0?.rawValue {
lemmas.insert(lemma)
}
}
return true
}