I hope I don't have to provide an example set.
I have a 2D array where each array contains a set of words from sentences.
I am using a CountVectorizer to effectively call fit_transform
on the whole 2D array, such that I can build a vocabulary of words.
However, I have sentences like:
u'Besides EU nations , Switzerland also made a high contribution at Rs 171 million LOCATION_SLOT~-nn+nations~-prep_besides nations~-prep_besides+made~prep_at made~prep_at+rs~num rs~num+NUMBER_SLOT'
And my current vectorizer is too strict at removing things like ~
and +
as tokens. Whereas I want it to see each word in terms of split()
a token in the vocab, i.e. rs~num+NUMBER_SLOT
should be a word in itself in the vocab, as should made
. At the same time, stopwords like the
the a
(the normal stopwords set) should be removed.
Current vectorizer:
vectorizer = CountVectorizer(analyzer="word",stop_words=None,tokenizer=None,preprocessor=None,max_features=5000)
You can specify a token_pattern
but I am not sure which one I could use to achieve my aims. Trying:
token_pattern="[^\s]*"
Leads to a vocabulary of:
{u'': 0, u'p~prep_to': 3764, u'de~dobj': 1107, u'wednesday': 4880, ...}
Which messes things up as u''
is not something I want in my vocabulary.
What is the right token pattern for this type of vocabulary_
I want to build?
I have figured this out. The vectorizer was allowing 0 or more non-whitespace items - it should allow 1 or more. The correct CountVectorizer
is:
CountVectorizer(analyzer="word",token_pattern="[\S]+",tokenizer=None,preprocessor=None,stop_words=None,max_features=5000)