I would like to build an NLP classification model. My input is a paragraph or a sentence. Ideally, my output is a score or probability (between 0 and 1).
I have defined specific entities ex-ante, each entity belongs to a single group.
Based on business insights, we know that the output to predict does not depend on the entities by themselves, but depends on their groups. For example, the phrase “Max barks” would return 1 because “Max” belongs to the group “Dogs”, but “Kitty barks” would return 0 (because Kitty is not a dog). If “Max” was a cat, the phrase would return 0. One way to do so would be to generate all the sentences with all the permutations of dogs and cats (in my example) but that is very cumbersome! Another way would be to replace the entity with the name of the group (the phrase “Max” becomes “” for example) but that looks weird to me!
I don't have any other idea how to tackle this problem.
Could you please help me, ideally with code?
Thanks a lot.
If I understand your question correctly, you are to classify the text into "dog activities" vs. "non-dog activities" and in the text you are referencing dogs, cats (and maybe other animals) by their names but you know which name is related with which species.
In such a case I would suggest introducing a named entity token replacing each name of an animal with its species. In your example "Max barks"
could be replaced with "%DOG% barks"
and "Kitty barks"
with "%CAT% barks"
.
This would form a strong signal for the model to pick up and train correctly.
Otherwise, you could also go with your approach of generating all of the potential examples of dogs and cats where the name would be loosely linked with a one or the other group by the label of the training / testing example. Even though it is a bit cumbersome it can be more practical that introducing another step to the processing pipeline - Name Entity Recognition - which translates the names of the animals to their species. And such a step would be necessary both in the training and during inference.