Search code examples
pythondataframekeywordn-gram

Remove keywords which are not bigram or trigram (Yake)


I am using Yake (Yet Another Keyword Extractor) to extract keywords from a dataframe. I want to extract only bigrams and trigrams, but Yake allows only to set a max ngram size and not a min size. How do you would remove them?

Example df.head(0):

Text: 'oui , yes , i mumbled , the linguistic transition now in limbo .'

Keywords: '[('oui', 0.04491197687864554), ('linguistic transition', 0.09700399286574239), ('mumbled', 0.15831692877998726)]'

I want to remove oui, mumbled and their scores from keywords column.

Thank you for your time!


Solution

  • If your problem is that the keywords list contains some monograms, you can simply do a filter that ignores words without spaces and create a new list. I'll give you an example:

    keywords_without_unigrams = []
    for kw in keywords:
        if(' ' in kw[0]):
            keywords_without_unigrams.append(kw)
     
    
    for kw in keywords_without_unigrams:
        print(kw)