Search code examples
rmatrixprobabilityapplyfrequency-distribution

How to convert frequency distribution to probability distribution in R


I have a matrix with n rows of observations. Observations are frequency distributions of the features. I would like to transform the frequency distributions to probability distributions where the sum of each row is 1. Therefore each element in the matrix should be divided by the sum of the row of the element.

I wrote the following R function that does the work but it is very slow with large matrices:

prob_dist <- function(x) {

    row_prob_dist <- function(row) {
       return (t(lapply(row, function(x,y=sum(row)) x/y)))
       }

    for (i in 1:nrow(x)) {
       if (i==1) p_dist <- row_prob_dist(x[i,])
       else p_dist <- rbind(p_dist, row_prob_dist(x[i,]))
       }
    return(p_dist)
}

B = matrix(c(2, 4, 3, 1, 5, 7), nrow=3, ncol=2)
B
     [,1] [,2]
[1,]    2    1
[2,]    4    5
[3,]    3    7

prob_dist(B)
     [,1]      [,2]    
[1,] 0.6666667 0.3333333
[2,] 0.4444444 0.5555556
[3,] 0.3       0.7     

Could you suggest R function that does the job and/or tell me how can I optimise my function to perform faster?


Solution

  • Here's an attempt, but on a dataframe instead of a matrix:

    df <- data.frame(replicate(100,sample(1:10, 10e4, rep=TRUE)))
    

    I tried a dplyr approach:

    library(dplyr)
    df %>% mutate(rs = rowSums(.)) %>% mutate_each(funs(. / rs), -rs) %>% select(-rs)
    

    Here are the results:

    library(microbenchmark) 
    mbm = microbenchmark(
    dplyr = df %>% mutate(rs = rowSums(.)) %>% mutate_each(funs(. / rs), -rs) %>% select(-rs),
    t = t(t(df) / rep(rowSums(df), each=ncol(df))),
    apply = t(apply(df, 1, prop.table)),
    times = 100
    )
    

    enter image description here

    #> mbm
    #Unit: milliseconds
    #  expr       min        lq      mean    median        uq       max neval
    # dplyr  123.1894  124.1664  137.7076  127.3376  131.1523  445.8857   100
    #     t  384.6002  390.2353  415.6141  394.8121  408.6669  787.2694   100
    # apply 1425.0576 1520.7925 1646.0082 1599.1109 1734.3689 2196.5003   100
    

    Edit: @David benchmark is more in line with OP so I suggest you consider his approach if you are to work with matrices.