Doc1: ['And that was the fallacy. Once I was free to talk with staff members']
Doc2: ['In the new, stripped-down, every-job-counts business climate, these human']
Doc3 : ['Another reality makes emotional intelligence ever more crucial']
Doc4: ['The globalization of the workforce puts a particular premium on emotional']
Doc5: ['As business changes, so do the traits needed to excel. Data tracking']
and this is a sample of my vocabulary:
my_vocabulary= [‘was the fallacy’, ‘free to’, ‘stripped-down’, ‘ever more’, ‘of the workforce’, ‘the traits needed’]
The point is every word in my vocabulary is a bigram or trigram. My vocabulary includes all possible bigram and trigrams in my document set, I just gave you a sample here. Based on the application this is how my vocab should be. I am trying to use countVectorizer as following to:
from sklearn.feature_extraction.text import CountVectorizer
doc_set = [Doc1, Doc2, Doc3, Doc4, Doc5]
vectorizer = CountVectorizer( vocabulary=my_vocabulary)
tf = vectorizer.fit_transform(doc_set)
I am expecting to get something like this :
print tf:
(0, 126) 1
(0, 6804) 1
(0, 5619) 1
(0, 5019) 2
(0, 5012) 1
(0, 999) 1
(0, 996) 1
(0, 4756) 4
where the first column is the document ID, the second column is the word ID in the vocabulary and the third column is the occurrence number of that word in that document. But tf is empty. I know at the end of the day, I can write a code that goes through all the words in the vocabulary and computes the occurrence and makes the matrix, but can I use the countVectorizer for this input that I have and save time? Am I doing something wrong here? If countVectorizer is not the right way to do it, any recommendation will be appreciated.
You can build a vocabulary of all possible bi-grams and tri-grams by specifying the ngram_range parameter in CountVectorizer. After fit_tranform you can view the vocabulary and frequency using the get_feature_names() and toarray() methods. The latter returns a frequency matrix for each document. Further information: http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction
from sklearn.feature_extraction.text import CountVectorizer
Doc1 = 'And that was the fallacy. Once I was free to talk with staff members'
Doc2 = 'In the new, stripped-down, every-job-counts business climate, these human'
Doc3 = 'Another reality makes emotional intelligence ever more crucial'
Doc4 = 'The globalization of the workforce puts a particular premium on emotional'
Doc5 = 'As business changes, so do the traits needed to excel. Data tracking'
doc_set = [Doc1, Doc2, Doc3, Doc4, Doc5]
vectorizer = CountVectorizer(ngram_range=(2, 3))
tf = vectorizer.fit_transform(doc_set)
vectorizer.vocabulary_
vectorizer.get_feature_names()
tf.toarray()
As for what you have tried to do, it would work if you train CountVectorizer on your vocabulary and then transform the documents.
my_vocabulary= ['was the fallacy', 'more crucial', 'particular premium', 'to excel', 'data tracking', 'another reality']
vectorizer = CountVectorizer(ngram_range=(2, 3))
vectorizer.fit_transform(my_vocabulary)
tf = vectorizer.transform(doc_set)
vectorizer.vocabulary_
Out[26]:
{'another reality': 0,
'data tracking': 1,
'more crucial': 2,
'particular premium': 3,
'the fallacy': 4,
'to excel': 5,
'was the': 6,
'was the fallacy': 7}
tf.toarray()
Out[25]:
array([[0, 0, 0, 0, 1, 0, 1, 1],
[0, 0, 0, 0, 0, 0, 0, 0],
[1, 0, 1, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 1, 0, 0]], dtype=int64)