I am using an Azure Cognitive Search skillset that includes the EntityRecognition skill to find all people, locations, and organizations from within blobs in Azure Storage.
When I run the skill, with varying minimumPrecision values, it always returns a list with duplicate values.
Is there a way to tell the skill to remove duplicates? or do I need to create a custom skill that processes the results of the EntityRecognition skill to remove said duplicates?
@Ishan's answer regarding the Distinct PowerSkill was the method that i took, but there were details that need to be added to the answer in order to ensure a comprehensive post.
The main goal was to chunk all document content into 50K character pages due to the indexed documents being very large. This allowed for each page having duplicate keyphrases, with further duplication across pages.
The challenge was how to take all keyphrase arrays for each page and pass them as a collection of words to the PowerSkill Distinct custom skill.
Below is the definition of the custom skill within the skillset used in my solution. The custom skill was deployed from the PowerSkills github repo to a Function App called Distinct20200629152300.
To retrieve the function URL you can grab it from the Code + Test section of the function and paste the URL into the skill definition, shown below.
The key part of the skill definition is the inputs word annotation /document/merged_content/pages/*/keyphrases/*
that will 'flatten' all page keyphrase arrays into one array. this allows the custom skill to have access to all page keyphrases in order to successfully deduplicate the entire list.