Search code examples
pythonmachine-learningdata-sciencek-meansunsupervised-learning

How to generate a new column based on some other column after clustering the data?


I have a dataframe like this with columns - ["A","B","C",D"]

A --> Categorical feature with 2 values, say Yes or No
B --> Categorical feature with 10 unique values, like "AAXX-10","BBYY-20" etc
C --> A date-time field
D --> Text-based column, describing if a person was interested in the movie or not based on short text(basically their comments after coming out of theatre)

Sample df

A  | B | C | D
------------------------------------------------------------------------------
Yes|AAXX-10|8/10/2018|"Yes I liked the movie, it was great"
------------------------------------------------------------------------------
Yes|BBYY-20|8/10/2017|"I liked the performance of the cast in the movie but as a whole, It was just average"
------------------------------------------------------------------------------
No |AANN-88|8/10/2013|"Never seen a ridiculous movie like this"

I have two questions here -

  1. I want to make a fifth column, say "Interest", based on the column "D" which would have 4 categories ["Liked", "Didn't like", "Average", "Cannot comment"]. How could I do that?

--On the basis of "D", the "Interest" column should have ["Liked", "Average", "Didn't like"]--.

  1. Since most of the columns are categorical and date-time, and one column as Text. How should I go ahead and do the feature engineering in this particular scenario to be able to feed to Kmeans?

How to get features out of column "D" which is a text feature?.

Should I convert column A to binary 0s a 1s?

Should I do one hot encoding/label encoding to the second column?

How to make use of the date-time feature in the clustering?

Things I have tried -

I did preprocess and feature engineering of column A(convert to binary), B(label encoding), C(Converted to year and month feature from dates) and D(ignored this feature as did not know how could I use it).

Based on this, I got clusters using kmeans.labels_, but those clusters are numeric 1,2,3,4.

How can I actually map those to ["Liked", "Didn't like", "Average", "Cannot comment"]? How can I use the text column efficiently to make the clusters?

Just short answers to my query would do. I don't need any implementation.


Solution

  • To answer the second question first:

    A: can be turned to binary

    B: what information can you get from a list of unique strings by encoding? After encoding you are left with either the identity matrix(One-Hot) or a list of monotonically increasing ints (label encoding)

    C: you might better transform to Timestamp unix epoch if the date range allows it, this allows you to caluclate distance properly.

    D: This is the bread and butter of the project. Processing step is very complex but a short summary:

    A basic recipe includes but is not limited to:

    1. Text normalization:
      • convert to lower or upper case
      • converting numbers into words or removing numbers,
      • removing punctuations, accent marks and other diacritics,
      • removing leading or trailing white spaces
    2. Corupus tokenization (Split each row into a list of single words)
      • remove stop words, (a, the ..) they contain very litle information and are common
    3. Stemming or Lemmatization. Tese reduce the words to a base form. Stemming is quite crude and could produce inavlid words, but is fast. Lemmatization produces valid words based on a dictionary, but is slower .... many more stuff n. Feature Extraction with TF-IDF, this is a sort of encoding that gives each word an importance score. This method works by increasing the weight of a word when it appears many times in a document, and lowering it’s weight when it’s common in many documents.

    Example for td-idf:

    from sklearn.feature_extraction.text import TfidfVectorizer
    corpus = [
        'This is the first document.',
        'This document is the second document.',
        'And this is the third one.',
        'Is this the first document?',
    ]
    vectorizer = TfidfVectorizer()
    X = vectorizer.fit_transform(corpus)
    print(vectorizer.get_feature_names())
    
    print(X.shape)
    

    After these n steps, you get the answer to your first question; The output could look something like this :

    enter image description here

    You can find code on how to do all this stuff here (with NLTK). You might not be allowed to use NLTK however, in which case, you will have a hard time doing all these steps.