python machine-learning data-science k-means unsupervised-learning

How to generate a new column based on some other column after clustering the data?

I have a dataframe like this with columns - ["A","B","C",D"]

A --> Categorical feature with 2 values, say Yes or No
B --> Categorical feature with 10 unique values, like "AAXX-10","BBYY-20" etc
C --> A date-time field
D --> Text-based column, describing if a person was interested in the movie or not based on short text(basically their comments after coming out of theatre)

Sample df

A  | B | C | D
------------------------------------------------------------------------------
Yes|AAXX-10|8/10/2018|"Yes I liked the movie, it was great"
------------------------------------------------------------------------------
Yes|BBYY-20|8/10/2017|"I liked the performance of the cast in the movie but as a whole, It was just average"
------------------------------------------------------------------------------
No |AANN-88|8/10/2013|"Never seen a ridiculous movie like this"

I have two questions here -

I want to make a fifth column, say "Interest", based on the column "D" which would have 4 categories ["Liked", "Didn't like", "Average", "Cannot comment"]. How could I do that?

--On the basis of "D", the "Interest" column should have ["Liked", "Average", "Didn't like"]--.

Since most of the columns are categorical and date-time, and one column as Text. How should I go ahead and do the feature engineering in this particular scenario to be able to feed to Kmeans?

How to get features out of column "D" which is a text feature?.

Should I convert column A to binary 0s a 1s?

Should I do one hot encoding/label encoding to the second column?

How to make use of the date-time feature in the clustering?

Things I have tried -

I did preprocess and feature engineering of column A(convert to binary), B(label encoding), C(Converted to year and month feature from dates) and D(ignored this feature as did not know how could I use it).

Based on this, I got clusters using kmeans.labels_, but those clusters are numeric 1,2,3,4.

How can I actually map those to ["Liked", "Didn't like", "Average", "Cannot comment"]? How can I use the text column efficiently to make the clusters?

Just short answers to my query would do. I don't need any implementation.

Solution

To answer the second question first:

A: can be turned to binary

B: what information can you get from a list of unique strings by encoding? After encoding you are left with either the identity matrix(One-Hot) or a list of monotonically increasing ints (label encoding)

C: you might better transform to Timestamp unix epoch if the date range allows it, this allows you to caluclate distance properly.

D: This is the bread and butter of the project. Processing step is very complex but a short summary:

A basic recipe includes but is not limited to:

Text normalization:
- convert to lower or upper case
- converting numbers into words or removing numbers,
- removing punctuations, accent marks and other diacritics,
- removing leading or trailing white spaces
Corupus tokenization (Split each row into a list of single words)
- remove stop words, (a, the ..) they contain very litle information and are common
Stemming or Lemmatization. Tese reduce the words to a base form. Stemming is quite crude and could produce inavlid words, but is fast. Lemmatization produces valid words based on a dictionary, but is slower .... many more stuff n. Feature Extraction with TF-IDF, this is a sort of encoding that gives each word an importance score. This method works by increasing the weight of a word when it appears many times in a document, and lowering it’s weight when it’s common in many documents.

Example for td-idf:

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())

print(X.shape)

After these n steps, you get the answer to your first question; The output could look something like this :

You can find code on how to do all this stuff here (with NLTK). You might not be allowed to use NLTK however, in which case, you will have a hard time doing all these steps.