Search code examples
pythonnlpsemanticssimilaritywordnet

Methods to extract keywords from large documents that are relevant to a set of predefined guidelines using NLP/ Semantic Similarity


I'm in need of suggestions how to extract keywords from a large document. The keywords should be inline what we have defined as the intended search results.

For example,

I need the owner's name, where the office is situated, what the operating industry is when a document about a company is given, and the defined set of words would be,

{owner, director, office, industry...}-(1)

the intended output has to be something like,

{Mr.Smith James, ,Main Street, Financial Banking}-(2)

I was looking for a method related to Semantic Similarity where sentences containing words similar to the given corpus (1), would be extracted, and using POS tagging to extract nouns from those sentences.

It would be a useful if further resources could be provided that support this approach.


Solution

  • What you want to do is referred to as Named Entity Recognition.

    In Python there is a popular library called SpaCy that can be used for that. The standard models are able to detect 18 different entity types which is a fairly good amount.

    Persons and company names should be extracted easily, while whole addresses and the industry might be more difficult. Maybe you would have to train your own model on these entity types. SpaCy also provides an API for training your own models. Please note, that you need quite a lot of training data to have decent results. Start with 1000 examples per entity type and see if it's sufficient for your needs. POS can be used as a feature.

    If your data is unstructured, this is probably one of most suited approaches. If you have more structured data, you could maybe take advantage of that.