Search code examples
pythonnlpgensim

How to prevent certain words from being included when building bigrams using Gensim's Phrases?


I am using Gensim's Phraser model to find bigrams in some reviews, to be later used in an LDA topic modelling scenario. My issue is that the reviews mention the word "service" quite often and so Phraser finds lots of different bigrams with "service" as one of the pairs (e.g "helpful_service", "good_service", "service_price").

These are then present across multiple topics in the final result*. I'm thinking that I could prevent this from occurring if I was able to tell Phraser not to include "service" when making bigrams. Is this possible?

(*) I am aware that "service"-related bigrams being present across multiple topics might indeed be the optimal result, but I just want to experiment with leaving them out.

Sample code:

# import gensim models
from gensim.models import Phrases
from gensim.models.phrases import Phraser

# sample data
data = [
    "Very quick service left a big tip",
    "Very bad service left a complaint to the manager"
]
data_words = [doc.split(" ") for doc in data]

# build the bigram model
bigram_phrases = Phrases(data_words, min_count=2, threshold=0, scoring='npmi') 
# note I used the arguments above to force "service" based bigrams to be created for this example
bigram_phraser = Phraser(bigram_phrases)

# print the result
for word in data_words:
    tokens_ = bigram_phraser[word]
    print(tokens_)

The above prints:

['Very', 'quick', 'service_left', 'a', 'big', 'tip']
['Very', 'bad', 'service_left', 'complaint', 'to', 'the', 'manager']

Solution

  • Caution: The following behavior seems to change with version 4.0.0!

    If you are indeed only working with bigrams, you can utilize the common_terms={} parameter of the function, which is (according to the docs

    [a] list of “stop words” that won’t affect frequency count of expressions containing them. Allow to detect expressions like “bank_of_america” or “eye_of_the_beholder”.

    If I add a simple common_terms={"service"} to your sample code, I am left with tge following result:

    ['Very', 'quick', 'service', 'left_a', 'big', 'tip']
    ['Very', 'bad', 'service', 'left_a', 'complaint', 'to', 'the', 'manager']
    

    Starting with version 4.0.0, gensim seemingly dropped this parameter, but replaced it with connector_words), see here. The results should largely be the same, though!