I'm fairly new to python and I would like to convert an array of sentences to bigrams, is there a way to do this? for example
X = ['I like u', 'u like me', ...]
If ngram = 2 I'm expecting the vocabulary has something like
[0: 'I ',
1: ' l',
2: 'li',
3: 'ik',
4: 'ke',
5: 'e ',
6: ' u',
7: 'u ',
8: ' m',
9: 'me'...]
so X can be converted to
X_conv = [ '0, 1, 2, 3, 4, 5, 6',
'7, 1, 2, 3, 4, 5, 8, 9',....]
Is there an functionI can do with countvectorizer?
Say, you have the function ngrams
:
def ngrams(text, n=2):
return [text[i:i+n] for i in range(len(text)-n+1)]
now applying this to all elements to a list is rather easy:
>>> sentences = ['I like u', 'u like me']
>>> processed = [ngrams(sentence, n=2) for sentence in sentences]
>>> processed
[['I ', ' l', 'li', 'ik', 'ke', 'e ', ' u'],
['u ', ' l', 'li', 'ik', 'ke', 'e ', ' m', 'me']]
So that is rather easy. To number the ngrams, you could build nested for loops, but it wouldn't look nice.
Instead we can use a trick: collections.defaultdict
, which will create a new item if it doesn't exist when it is first accessed. We couple this with itertools.count()
which returns a iterable counter. The __next__
magic method is a callable that when called the first time returns the first number, then the second and so forth. defaultdict
will call this method once per each new item
from collections import defaultdict
from itertools import count
reverse_vocabulary = defaultdict(count().__next__)
numbered = [[reverse_vocabulary[ngram] for ngram in sentence]
for sentence in processed]
print(numbered)
# [[0, 1, 2, 3, 4, 5, 6], [7, 1, 2, 3, 4, 5, 8, 9]]
Now the reverse vocabulary is the opposite of what you'd want:
defaultdict(<...>, {' m': 8, ' u': 6, 'I ': 0, 'li': 2, 'u ': 7, 'e ': 5, 'ke': 4, 'ik': 3,
' l': 1, 'me': 9})
We make an ordinary dictionary of it by inverting the mapping:
vocabulary = {number: ngram for ngram, number in reverse_vocabulary.items()}
which results in vocabulary being an ordinary dictionary
{0: 'I ', 1: ' l', 2: 'li', 3: 'ik', 4: 'ke', 5: 'e ', 6: ' u', 7: 'u ', 8: ' m', 9: 'me'}