Search code examples
rcluster-analysisdata-mining

How to run cluster analysis on R for text based data


Excel Data would contain 36 Factors (basically yes or no Questions) collected from users. Based on this question is there any way to run cluster analysis? I tried using iris example as reference, but as my data is completely text base, trying to figure out a way.

The date would be like:

            Q 1     Q 2     Q 3     Q 4     Q 5
People 1    Yes     Yes     Yes     Yes     Yes 
People 2    No      Yes     No      Yes     No
People 3    No      No      No      No      No
People 4    Yes     No      Yes     No      Yes 
People 5    No      Yes     No      Yes     No
People 6    Yes     No      Yes     No      Yes 
People 7    No      Yes     No      Yes     No

Solution

  • as I reffer to online blogs, Crossvalidated Stackexchange or other resources for the factor analysis, I am showing here an approach, how to get your data numeric.

    Here is how I reproduced your data:

    library(tidyverse)
    df <- read_table("Person ID     Q1     Q2     Q3     Q4     Q5
    People 1    Yes     Yes     Yes     Yes     Yes 
    People 2    No      Yes     No      Yes     No
    People 3    No      No      No      No      No
    People 4    Yes     No      Yes     No      Yes 
    People 5    No      Yes     No      Yes     No
    People 6    Yes     No      Yes     No      Yes 
    People 7    No      Yes     No      Yes     No") %>% 
      unite("PersonID", Person, ID, sep = "")
    

    Now your need to swap the text to factors and than to numeric data.

    df %>% 
      mutate_if(grepl("Q", names(.)), as.factor) %>% 
      mutate_if(is.factor, as.numeric) 
    

    Output is:

    # A tibble: 7 x 6
      PersonID    Q1    Q2    Q3    Q4    Q5
      <chr>    <dbl> <dbl> <dbl> <dbl> <dbl>
    1 People1      2     2     2     2     2
    2 People2      1     2     1     2     1
    3 People3      1     1     1     1     1
    4 People4      2     1     2     1     2
    5 People5      1     2     1     2     1
    6 People6      2     1     2     1     2
    7 People7      1     2     1     2     1
    

    Now you can perform a correlation, which you might need for your factor analysis:

    df %>% 
      select(-1) %>% 
      cor()
    

    Hope that approach helps.