Search code examples
artificial-intelligencesimilarity

Algorithm to find related words in a text


I would like to have a word (e.g. "Apple) and process a text (or maybe more). I'd like to come up with related terms. For example: process a document for Apple and find that iPod, iPhone, Mac are terms related to "Apple".

Any idea on how to solve this?


Solution

  • As a starting point: your question relates to text mining.

    There are two ways: a statistical approach, and one form natural language processing (nlp).

    I do not know much about nlp, but can say something about the statistical approach:

    1. You need some vector space representation of your documents, see http://en.wikipedia.org/wiki/Vector_space_model http://en.wikipedia.org/wiki/Document-term_matrix http://en.wikipedia.org/wiki/Tf%E2%80%93idf

    2. In order to learn semantics, that is: different words mean the same, or one word can have different meanings, you need a large text corpus for learning. As I said this is a statistical approach, so you need lots of samples. http://www.daviddlewis.com/resources/testcollections/

      Maybe you have lots of documents from the context you are going to use. That is the best situation.

    3. You have to retrieve latent factors from this corpus. Most common are:

      These methods involve lots of math. Either you dig it, or you have to find good libraries.

    I can recommend the following books: