Search code examples
pythontf-idfcosine-similarity

Add stop_words while performing TF-IFcosine similarity


I'm using sklearn to perform cosine similarity.

Is there a way to consider all the words starting with a capital letter as stop words?


Solution

  • The following regex will take as input a string, and remove/replace all sequences of alphanumeric characters that begin with an uppercase character with the empty string. See http://docs.python.org/2.7/library/re.html for more options.

    s1 = "The cat Went to The store To get Some food doNotMatch"
    r1 = re.compile('\\b[A-Z]\w+')
    r1.sub('',s1)
    ' cat  to  store  get  food doNotMatch'
    

    Sklearn also has many great facilities for text feature generation, such as sklearn.feature_extraction.text Also you might want to consider NLTK to assist in sentence segmentation, removing stop words, etc...