Gensim phrase handling sentence with a lot of punctuation

Now I am trying to use gensim Phrases in order to learn the phrase/special meaning base on my own corpus.

Suppose I have the corpus related to the car brand, by removing the punctuation and stopwords, tokenizing the sentence, eg:

sent1 = 'aston martin is a car brand'
sent2 = 'audi is a car brand'
sent3 = 'bmw is a car brand'
...

In this way, I would like to use gensim Phrases to learn so that output looks like:

from gensim.models import Phrases
sents = [sent1, sent2, sent3, ...]
sents_stream = [sent.split() for sent in sents]
bigram = Phrases(sents_stream)

for sent in sents:
    print(bigram [sent])

# Ouput should be like:
['aston_martin', 'car', 'brand']
['audi', 'car', 'brand']
['bmw', 'car', 'brand']
...

However, if a lot of sentences that have a lot of punctuation:

sent1 = 'aston martin is a car brand'
sent2 = 'audi is a car brand'
sent3 = 'bmw is a car brand'
sent4 = 'jaguar, aston martin, mini cooper are british car brand'
sent5 = 'In all brand, I love jaguar, aston martin and mini cooper'
...

Then the output looks like:

from gensim.models import Phrases
sents = [sent1, sent2, sent3, sent4, sent5, ...]
sents_stream = [sent.split() for sent in sents]
bigram = Phrases(sents_stream)

for sent in sents:
    print(bigram [sent])

# Ouput should be like:
['aston', 'martin', 'car', 'brand']
['audi', 'car', 'brand']
['bmw', 'car', 'brand']
['jaguar', 'aston', 'martin_mini', 'cooper', 'british', 'car', 'brand']
['all', 'brand', 'love', 'jaguar', 'aston', 'martin_mini', 'cooper']
...

In this case, how should I handle the sentence with lot of punctuation to prevent martin_mini case and make the output looks like:

['aston', 'martin', 'car', 'brand']
['audi', 'car', 'brand']
['bmw', 'car', 'brand']
['jaguar', 'aston_martin', 'mini_cooper', 'british', 'car', 'brand'] # Change
['all', 'brand', 'love', 'jaguar', 'aston_martin', 'mini_cooper'] # Change
...

Thanks so much for helping!

Solution

The punctuation may not be the major contributor to your unsatisfactory results.

The Phrases class needs lots of natural usage examples to apply its purely-statistics-based combination of plausible bigrams. (It won't work well on smal/toy-sized/contrived datasets.)

And even with lots of data, that Phrases class won't consistently match the "phrases" or "entities" that humans naturally perceive, using their understanding of parts-of-speech and the underlying concepts in the world. Even with lots of tuning of its various meta-parameters, it will miss pairings you might prefer, and make pairings you may consider unnatural. Text with its pairings added may still be useful for many purposes – especially classification & info-retrieval tasks – but is unlikely to appear aesthetically correct to human reviewers.

In your tiny contrived example, it appears that martin_mini becomes a bigram because the words martin and mini appear alongside each other enough, compared to their individual frequencies, to trigger the Phrases algorithmic-combination.

To prevent that particular outcome, you could consider (1) giving Phrases more/better data; (2) tuning Phrases parameters like min_count, threshold, or scorer; or (3) changing your preprocessing/tokenization.

I'm not sure what would work best, for your full dataset & project goals, and as noted above, the results of this technique may never closely match your ideas of mutli-word car terms.

You might also consider leaving in punctuation as tokens, and leaving in stop words, so that your preprocessing doesn't create false pairings like "martin mini". For example, your sent5 tokenization could become:

['in', 'all', 'brand', ',', 'i', 'love', 'jaguar', ',', 'aston', 'martin', 'and', 'mini', 'cooper']

The data's natural splitting of martin and mini would then be restored in the version that reaches Phrases – so you'd be unlikely to see the same failure you're seeing. (You might very well see other failures instead, where undesired punctuation or stop-words become part of identified bigrams, when statistics imply those tokens co-occur often enough to be considered a single unit. But that's the essence and limitation of the Phrases algorithm.)