Search code examples
pythonregextext-miningtf-idfstop-words

Hot to remove one letter token with TF-IDF Vectorizer


I'm working on a small project to calculate the tf_idf in this document which basically contains book titles and their abstracts. So far I only managed to remove stopwords and numbers, now my goal is to select words that contain at least three letters and up and do a lemmatization of the words. This is the code I have written:

from sklearn.feature_extraction.text import TfidfVectorizer
tf_idf = TfidfVectorizer(stop_words='english', token_pattern=r'(?u)\b[A-Za-z]+\b')
tfidf_matrix = tf_idf.fit_transform(doc)
print(tfidf_matrix)

If I print "tf_idf.vocabulary_" I get all words that occur in the document as well as letters such as r,s,t,m etc. As far as lemmatization is concerned, I don't know how to go about it and I still don't understand how it works, if someone can give me a hand I thank you in advance.


Solution

  • token_patternstr, default=r”(?u)\b\w\w+\b” Regular expression denoting what constitutes a “token”, only used if analyzer == 'word'. The default regexp selects tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator).

    To select words that contain at least three letters change your regex:

    tf_idf = TfidfVectorizer(stop_words='english', token_pattern=r'(?u)\b[A-Za-z]+\b')
    

    to regex quantifer {3,}, which match its preceding element at least n times.

    tf_idf = TfidfVectorizer(stop_words='english', analyzer='word', token_pattern=r'(?u)\b[A-Za-z]{3,}\b')
    
    # doc used as sample text.
    doc = """Hi Lucia. How are you? It was so nice to meet you last week in Sydney at the sales meeting. How was the rest of your trip? Did you see any kangaroos? I hope you got home to Mexico City OK.
    Anyway, I have the documents about the new Berlin offices. We're going to be open in three months. I moved here from London just last week. They are very nice offices, and the location is perfect.
    There are lots of restaurants, cafés and banks in the area. There's also public transport; we are next to an U-Bahn (that is the name for the metro here). Maybe you can come and see them one day? I would love to show you Berlin, especially in the winter. You said you have never seen snow – you will see lots here! Here's a photo of you and me at the restaurant in Sydney. That was a very fun night! Remember the singing Englishman? Crazy! Please send me any other photos you have of that night. Good memories.
    Please give me your email address and I will send you the documents. Bye for now. Mikel"""
    
    print(tf_idf.vocabulary_)
    {
       "lucia": 27,
       "nice": 38,
       "meet": 29,
       "week": 59,
       "sydney": 56,
       "sales": 51,
       "meeting": 30,
       "rest": 47,
       "trip": 58,
       "did": 10,
       "kangaroos": 22,
       "hope": 20,
       "got": 18,
       ...
       ...