Search code examples
rfrequencyrankingfrequency-distribution

r element frequency and column name


I have a dataframe that has four columns A, B, C and D:

A    B    C    D
a    a    b    c
b    c    x    e
c    d    y    a
d              z
e
f

I would like to get the frequency of all elements and lists of columns they appear, ordered by the frequency ranking. The output would be something like this:

  Ranking  frequency column 
a    1         3      A, B, D
c    1         3      A, B, D
b    2         2      A, C
d    2         2      A, B
e    2         2      A, D
f  .....

I would appreciate any help. Thank you!


Solution

  • Something like this maybe:

    Data

    df <- read.table(header=T, text='A    B    C    D
    a    a    b    c
    b    c    x    e
    c    d    y    a
    d   NA    NA     z
    e  NA NA NA
    f NA NA NA',stringsAsFactors=F)
    

    Solution

    #find unique elements
    elements <- unique(unlist(sapply(df, unique)))
    
    #use a lapply to find the info you need
    df2 <- data.frame(do.call(rbind,
            lapply(elements, function(x) {
              #find the rows and columns of the elements
              a <- which(df == x, arr.ind=TRUE)
              #find column names of the elements found
              b <- names(df[a[,2]])
              #find frequency
              c <- nrow(a)
              #produce output
              c(x, c, paste(b, collapse=','))
    })))
    
    #remove NAs
    df2 <- na.omit(df2)
    #change column names
    colnames(df2) <- c('element','frequency', 'columns')
    #order according to frequency
    df2 <- df2[order(df2$frequency, decreasing=TRUE),]
    #create the ranking column
    df2$ranking <- as.numeric(factor(df2$frequency,levels=unique(df2$frequency)))
    

    Output:

    > df2
       element frequency columns ranking
    1        a         3   A,B,D       1
    3        c         3   A,B,D       1
    2        b         2     A,C       2
    4        d         2     A,B       2
    5        e         2     A,D       2
    6        f         1       A       3
    8        x         1       C       3
    9        y         1       C       3
    10       z         1       D       3
    

    And if you want the elements column to be as row.names and the ranking column to be first you can also do:

    row.names(df2) <- df2$element
    df2$element <- NULL
    df2 <- df2[c('ranking','frequency','columns')]
    

    Output:

     > df2
      ranking frequency columns
    a       1         3   A,B,D
    c       1         3   A,B,D
    b       2         2     A,C
    d       2         2     A,B
    e       2         2     A,D
    f       3         1       A
    x       3         1       C
    y       3         1       C
    z       3         1       D