I have a dataframe like this with columns - ["A","B","C",D"]
A --> Categorical feature with 2 values, say Yes or No
B --> Categorical feature with 10 unique values, like "AAXX-10","BBYY-20" etc
C --> A date-time field
D --> Text-based column, describing if a person was interested in the movie or not based on short text(basically their comments after coming out of theatre)
Sample df
A | B | C | D
------------------------------------------------------------------------------
Yes|AAXX-10|8/10/2018|"Yes I liked the movie, it was great"
------------------------------------------------------------------------------
Yes|BBYY-20|8/10/2017|"I liked the performance of the cast in the movie but as a whole, It was just average"
------------------------------------------------------------------------------
No |AANN-88|8/10/2013|"Never seen a ridiculous movie like this"
I have two questions here -
["Liked", "Didn't like", "Average", "Cannot comment"]
. How could I do that?--On the basis of "D", the "Interest" column should have ["Liked", "Average", "Didn't like"]--.
How to get features out of column "D" which is a text feature?.
Should I convert column A to binary 0s a 1s?
Should I do one hot encoding/label encoding to the second column?
How to make use of the date-time feature in the clustering?
Things I have tried -
I did preprocess and feature engineering of column A(convert to binary), B(label encoding), C(Converted to year and month feature from dates) and D(ignored this feature as did not know how could I use it).
Based on this, I got clusters using kmeans.labels_
, but those clusters are numeric 1,2,3,4.
How can I actually map those to ["Liked", "Didn't like", "Average", "Cannot comment"]
?
How can I use the text column efficiently to make the clusters?
Just short answers to my query would do. I don't need any implementation.
To answer the second question first:
A: can be turned to binary
B: what information can you get from a list of unique strings by encoding? After encoding you are left with either the identity matrix(One-Hot) or a list of monotonically increasing ints (label encoding)
C: you might better transform to Timestamp unix epoch if the date range allows it, this allows you to caluclate distance properly.
D: This is the bread and butter of the project. Processing step is very complex but a short summary:
A basic recipe includes but is not limited to:
Example for td-idf:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
print(X.shape)
After these n steps, you get the answer to your first question; The output could look something like this :
You can find code on how to do all this stuff here (with NLTK). You might not be allowed to use NLTK however, in which case, you will have a hard time doing all these steps.