After using NameFinderME to find the names in a series of tokens, I would like to reverse the tokenization and reconstruct the original text with the names that have been modified. Is there a way I can reverse the tokenization operation in the exact way in which it was performed, so that the output is the exact structure as the input?
Example
Hello my name is John. This is another sentence.
Find sentences
Hello my name is John.
This is another sentence.
Tokenize sentences.
> Hello
> my
> name
> is
> John.
>
> This
> is
> another
> sentence.
My code that analyzes the tokens above looks something like this so far.
TokenNameFinderModel model3 = new TokenNameFinderModel(modelIn3);
NameFinderME nameFinder = new NameFinderME(model3);
List<Span[]> spans = new List<Span[]>();
foreach (string sentence in sentences)
{
String[] tokens = tokenizer.tokenize(sentence);
Span[] nameSpans = nameFinder.find(tokens);
string[] namedEntities = Span.spansToStrings(nameSpans, tokens);
//I want to modify each of the named entities found
//foreach(string s in namedEntities) { modifystring(s) };
spans.Add(nameSpans);
}
Desired output, perhaps masking the names that were found.
Hello my name is XXXX. This is another sentence.
In the documentation, there is a link to this post describing how to use the detokenizer. I don't understand how the operations array relates to the original tokenization (if at all)
https://issues.apache.org/jira/browse/OPENNLP-216
Create instance of SimpleTokenizer.
String sentence = "He said \"This is a test\".";
SimpleTokenizer instance = SimpleTokenizer.INSTANCE;
Tokenize the sentence using tokenize(String str) method from SimpleTokenizer
String tokens[] = instance.tokenize(sentence);
The operations array must have the same number of operation name as tokens array. Basically array length should be equal.
Store the operation name N-times (tokens.length times) into operation array.
Operation operations[] = new Operation[tokens.length];
String oper = "MOVE_RIGHT"; // please refer above list for the list of operations
for (int i = 0; i < tokens.length; i++)
{ operations[i] = Operation.parse(oper); }
System.out.println(operations.length);
Here the operation array length will be equal to the tokens array length.
Now create an instance of DetokenizationDictionary by passing tokens and operations arrays to the constructor.
DetokenizationDictionary detokenizeDict = new DetokenizationDictionary(tokens, operations);
Pass DetokenizationDictionary instance to the DictionaryDetokenizer class to detokenize the tokens.
DictionaryDetokenizer dictDetokenize = new DictionaryDetokenizer(detokenizeDict);
DictionaryDetokenizer.detokenize requires two parameters. a). tokens array and b). split marker
String st = dictDetokenize.detokenize(tokens, " ");
Output:
Use the Detokenizer.
String text = detokenize(myTokens, null);