Search code examples
python-3.xvectorizationsparse-matrixsentiment-analysistf-idf

Document Vectorization Representation in Python


I was trying my hand at sentiment analysis in python 3, and was using the TDF-IDF vectorizer with the bag-of-words model to vectorize a document.

So, to anyone who is familiar with that, it is quite evident that the resulting matrix representation is sparse.

Here is a snippet of my code. Firstly, the documents.

tweets = [('Once you get inside you will be impressed with the place.',1),('I got home to see the driest damn wings ever!',0),('An extensive menu provides lots of options for breakfast.',1),('The flair bartenders are absolutely amazing!',1),('My first visit to Hiro was a delight!',1),('Poor service, the waiter made me feel like I was stupid every time he came to the table.',0),('Loved this place.',1),('This restaurant has great food',1),
      ('Honeslty it did not taste THAT fresh :(',0),('Would not go back.',0),
       ('I was shocked because no signs indicate cash only.',0),
        ('Waitress was a little slow in service.',0),
        ('did not like at all',0),('The food, amazing.',1),
        ('The burger is good beef, cooked just right.',1),
        ('They have horrible attitudes towards customers, and talk down to each one when customers do not enjoy their food.',0),
        ('The cocktails are all handmade and delicious.',1),('This restaurant has terrible food',0),
        ('Both of the egg rolls were fantastic.',1),('The WORST EXPERIENCE EVER.',0),
        ('My friend loved the salmon tartar.',1),('Which are small and not worth the price.',0),
        ('This is the place where I first had pho and it was amazing!!',1),
        ('Horrible - do not waste your time and money.',0),('Seriously flavorful delights, folks.',1),
        ('I loved the bacon wrapped dates.',1),('I dressed up to be treated so rudely!',0),
        ('We literally sat there for 20 minutes with no one asking to take our order.',0),
        ('you can watch them preparing the delicious food! :)',1),('In the summer, you can dine in a charming outdoor patio - so very delightful.',1)]

X_train, y_train = zip(*tweets)

And the following code to vectorize the documents.

tfidfvec = TfidfVectorizer(lowercase=True)
vectorized = tfidfvec.fit_transform(X_train)

print(vectorized)

When I print vectorized, it does not output a normal matrix. Instead, this: enter image description here

If I'm not wrong, this must be a sparse matrix representation. However, I am not able to comprehend its format, and what each term means.

Also, there are 30 documents. So, that explains the 0-29 on the first column. If that's the trend then I'm guessing the second column is the index of the words, and the last value is it's tf-idf? It just struck me while I was typing my question, but kindly correct me if I'm wrong.

Could anyone with experience in this help me understand it better?


Solution

  • Yes, technically the first two tuples represent the row-column position, and the third column is the value in that position. So it is basically showing the position and values of the nonzero values.