Search code examples
pythonnlptext-extractionnamed-entity-recognitionedit-distance

How to extract a custom list of entities from a text file?


I have a list of entities which look something like this:

["Bluechoice HMO/POS", "Pathway X HMO/PPO", "HMO", "Indemnity/Traditional Health Plan/Standard"]

It's not the exhaustive list, there are other similar entries.

I want to extract these entities, if present, from a text file (with over 30 pages of information). The crunch here is that this text file is generated using OCR and thus might not contain the exact entries. That is, for example, it might have:

"Out of all the entries the user made, BIueChoise HMOIPOS is the most prominent"

Notice the spelling mistake in "BIueChoise HMOIPOS" w.r.t. "Bluechoice HMO/POS".

I want those entities which are present in the text file even if the corresponding words do not match perfectly.

Any help, be it an algorithm or an approach, is welcomed. Thanks a lot!


Solution

  • You can do this by using algorithms that can approximately match strings and determine how similar they are, like Levenshtein distance, Hamming distance, Cosine similarity, and many more.

    textdistance is a module that has a wide range of such algorithms present that you can use. Check about it here.

    I had similar problem that I solved using textdistance by picking substrings from the textfile of length equal to the string I needed to search/extract, and then use one of the algorithms to see which one solves my problem. For me it was the cosine similarity which gave me the best results when I filtered out strings that fuzzy matched above 75%.

    Taking "Bluechoice HMO/POS" from your question as an example to give you an idea, I applied it like below:

    >>> import textdistance
    >>>
    >>> search_strg = "Bluechoice HMO/POS"
    >>> text_file_strg = "Out of all the entries the user made, BIueChoise HMOIPOS is the most prominent"
    >>>
    >>> extracted_strgs = []
    >>> for substr in [text_file_strg[i:i+len(search_strg)] for i in range(0,len(text_file_strg) - len(search_strg)+1)]:
    ...     if textdistance.cosine(substr, search_strg) > 0.75:
    ...             extracted_strgs.append(substr)
    ... 
    >>> extracted_strgs
    ['BIueChoise HMOIPOS']