Currently I am reading text from excel file and applying bigram to it. finalList has list used in below sample code has the list of input words read from input excel file.
Removed the stopwords from input with help of following library:
from nltk.corpus import stopwords
bigram logic applied on list of input text of words
bigram=ngrams(finalList ,2)
input text: I completed my end-to-end process.
Current output: Completed end, end end, end process.
Desired output: completed end-to-end, end-to-end process.
That means some group of words like (end-to-end) should be considered as 1 word.
To solve your problem, you have to clean the stop words using regex. See this example:
import re
text = 'I completed my end-to-end process..:?'
pattern = re.compile(r"\.*:\?*") # to remove zero or more instances of such stop words, the hyphen is not included in the stop words.
new_text = re.sub(pattern, '', text)
print(new_text)
'I completed my end-to-end process'
# Now you can generate bigrams manually.
# 1. Tokanize the new text
tok = new_text.split()
print(tok) # If the size of token is huge, just print the first five ones, like this print(tok[:5])
['I', 'completed', 'my', 'end-to-end', 'process']
# 2. Loop over the list and generate bigrams, store them in a var called bigrams
bigrams = []
for i in range(len(tok) - 1): # -1 to avoid index error
bigram = tok[i] + ' ' + tok[i + 1]
bigrams.append(bigram)
# 3. Print your bigrams
for bi in bigrams:
print(bi, end = ', ')
I completed, completed my, my end-to-end, end-to-end process,
I hope this helps!