How to identify terms from list in unseen documents

Given a list of predefined terms that can be formed by one, two or even three words, the problem is to count their ocurrences in a set of documents with a free vocabulary (ie, much many words).

terms= [
[t1],
[t2, t3],
[t4, t5, t6],
[t7],...]

and the documents where this terms needs to be recognized are in the form of:

docs = [
[w1, w2, t1, w3, w4, t7],        #d1
[w1, w4, t4, t5, t6, wi, ...],   #d2
[wj, t7, ..] ..]                 #d3

The desired output should be

[2, 1, 1, ...]

This is, the first doc has two terms of interest, the second has 1 (formed of three words) and so on.

If the terms needed to be accounted for where 1 word length, then I could easily order each document alphabetically, remove repeted terms (set) and then intersect with the terms of size 1 word. Counting repeated words are the searched result.

But with terms of length >=2 things get tricky.

I've been using gensim to form a bag of words and detect the indexes when using a new phrase

e.g.

dict_terms = corpora.Dictionary(phrases)

sentence = unseen_docs[0]
idxs     = dict_terms[sentence]

And then count the seend idxs considering if the indexes are sequential, that would mean that a single term has been seen and not 2 o 3 of them.

Any suggestions.

Solution

In Scikit-learn ( a very popular Python package for Machine Learning) has a Module that does exactly what you're asking:

Here's how to do it:

First install sklearn

pip install scikit-learn

Now the code:

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(ngram_range=(1, 3))

#Given your corpus is an iterable of strings, or a List of strings, for simplicity:
corpus = [...]

X = vectorizer.fit_transform(corpus)

print(X)

The output is a matrix of size m x n. E.g:

[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]

Columns represent words, lines represent documents. So for each line, you have the resulting bag of words.

But how to retrieve which words appear where? You can get each "column" name, by using:

print(vectorizer.get_feature_names())

You'll get a list of words (the words are organized alphabetically).

Now, suppose you want to know the number of times that each word appears in your corpus (not on a single document).

The matrix you receive as output is a "numpy" (another package) array. This can be easy flattened (sum all lines) by doing:

import numpy as np #np is like a convention for numpy, if you don't know this already.

sum_of_all_words = np.sum(X, axis=0)

That'll give you something like:

[[1 4 2 4 1 1 4 1 4]]

The column order is the same for the words.

Finally, you can filter the terms from your dictionary by doing:

dict_terms = corpora.Dictionary(phrases)
counts = {}
words = vectorizer.get_feature_names()
for idx, word in enumerate(words):
   if word in dict_terms:
      counts[word] = sum_of_all_words[0, idx]

Hope this helps!

(Also, give a look on TFIDFVectorizer, if you're using bag of words, tf-idf is a huge upgrade in most cases)

I also recommend you to take a look at this page for feature extraction with sklearn: https://scikit-learn.org/stable/modules/feature_extraction.html