There is any way to tokenize strings with ngram range? Like when u get the features from a CountVectorizer. For example, (w ngram range = (1,2)):
strings = ['this is the first sentence','this is the second sentence']
[['this','this is','is','is the','the','the first',''first','first sentence','sentence'],['this','this is','is','is the','the','the second',''second','second sentence','sentence']]
Update: iterating over n i get:
sentence = 'this is the first sentence'
nrange_array = []
for n in range(1,3):
nrange = ngrams(sentence.split(),n)
for nrange in nrange_array:
for grams in nrange:
('this', 'is')
('is', 'the')
('the', 'first')
('first', 'sentence')
and i want:
('this','this is','is','is the','the','the first','first','first sentence','sentence')
I hope that code could help you.
x = "this is the first sentence"
words = x.split()
result = []
for index, word in enumerate(words):
if index is not len(words) - 1:
result.append(" ".join([word, words[index + 1]]))
print(result) # Output: ["this", "this is", ...]