Search code examples
rentropy

to calculate the Entropy


I am new to R and unable to calculate the entropy. There is a similar question on stackoverflow with the answer but i wanted to know why this code isn't working. Here is the copy paste data from the same question.

One of the answer mentions, "The part I think you are missing is the calculation of the class frequencies and you will get your answer", but how do i fix this. I tried most of the options but still i don't get any output. It just runs without any errors.

info <- function(CLASS.FREQ){
      freq.class <- CLASS.FREQ
      info <- 0
      for(i in 1:length(freq.class)){
        if(freq.class[[i]] != 0){ # zero check in class
          entropy <- -sum(freq.class[[i]] * log2(freq.class[[i]]))  #I calculate the entropy for each class i here
        }else{ 
          entropy <- 0
        } 
        info <- info + entropy # sum up entropy from all classes
      }
      return(info)
    }

Dataset as below,

buys <- c("no", "no", "yes", "yes", "yes", "no", "yes", "no", "yes", "yes", "yes", "yes", "yes", "no")

credit <- c("fair", "excellent", "fair", "fair", "fair", "excellent", "excellent", "fair", "fair", "fair", "excellent", "excellent", "fair", "excellent")

student <- c("no", "no", "no","no", "yes", "yes", "yes", "no", "yes", "yes", "yes", "no", "yes", "no")

income <- c("high", "high", "high", "medium", "low", "low", "low", "medium", "low", "medium", "medium", "medium", "high", "medium")

age <- c(25, 27, 35, 41, 48, 42, 36, 29, 26, 45, 23, 33, 37, 44) 

we change the age from categorical to numeric

Cheers, Jack


Solution

  • You need to calculate the propertion of "no" and "yes" in "buys", the proportion of "fair" and "excellent" in "credit", and so on. Here is one way to do it:

    data <- list(
      buys = c("no", "no", "yes", "yes", "yes", "no", "yes", "no", "yes", "yes", "yes", "yes", "yes", "no"),
      credit = c("fair", "excellent", "fair", "fair", "fair", "excellent", "excellent", "fair", "fair", "fair", "excellent", "excellent", "fair", "excellent"),
      student = c("no", "no", "no","no", "yes", "yes", "yes", "no", "yes", "yes", "yes", "no", "yes", "no"),
      income = c("high", "high", "high", "medium", "low", "low", "low", "medium", "low", "medium", "medium", "medium", "high", "medium"),
      age = c(25, 27, 35, 41, 48, 42, 36, 29, 26, 45, 23, 33, 37, 44) 
      )
    
    freq <- lapply( data, function(x){rowMeans(outer(unique(x),x,"=="))})
    

    .

    > freq
    $buys
    [1] 0.3571429 0.6428571
    
    $credit
    [1] 0.5714286 0.4285714
    
    $student
    [1] 0.5 0.5
    
    $income
    [1] 0.2857143 0.4285714 0.2857143
    
    $age
     [1] 0.07142857 0.07142857 0.07142857 0.07142857 0.07142857 0.07142857 0.07142857 0.07142857 0.07142857 0.07142857 0.07142857 0.07142857 0.07142857
    [14] 0.07142857
    

    Such a proportion can never be 0. So change if(freq.class[[i]] != 0){ # zero check in class to if(length(freq.class[[i]]) != 0){ # zero check in class:

    info <- function(CLASS.FREQ){
      freq.class <- CLASS.FREQ
      info <- 0
      for(i in 1:length(freq.class)){
        if(length(freq.class[[i]]) != 0){ # zero check in class
          entropy <- -sum(freq.class[[i]] * log2(freq.class[[i]]))  #I calculate the entropy for each class i here
        }else{ 
          entropy <- 0
        } 
        info <- info + entropy # sum up entropy from all classes
      }
      return(info)
    }
    

    .

    > info(freq)
    [1] 8.289526
    > info(freq$buys)
    [1] 0.940286
    > info(freq$age)
    [1] 3.807355
    >