Search code examples
rplotdataframefrequency-analysis

How to loop through all columns and compare to a specific column and plot frequency read outs


I have a dataframe that looks kind of looks like this:

 y<-c("A1","B1", "C2", "A1", "B1","C1", "A1","B2", "C3", "A1", "B1", "C4", "A1", "B1","C4", "A1","B2", "C4", "A1","B1", "C4", "A1", "B1", "C4")
 test<- data.frame(matrix(y, nrow = 3, ncol = 8))
 colnames(test) <- c("Learn_1", "Car_1", "Car_2", "Fan_1", "Fan_2", "Fan_3","Kart_1", "God_1")
 test

 Learn_1 Car_1 Car_2 Fan_1 Fan_2 Fan_3 Kart_1 God_1
 1      A1    A1    A1    A1    A1    A1     A1    A1
 2      B1    B1    B2    B1    B1    B2     B1    B1
 3      C2    C1    C3    C4    C4    C4     C4    C4

My real data has 13 columns of unequal length and thousands of rows and the values are mixed up. I want to determine the frequency of each value in God_1 to all the other columns, but for each column with the same word (means columns are from same study) (i.e. columns Fan and Car count the value frequency as 1 if that value shows up more than once within those columns. Then I want to plot the percent of values that show up 5,4,3,2,1 over the total percent (100%) of values available in GOD_1. I am thinking a box the shows the total number of values, then different percent shades that discriminate the frequency values (1,2,3,4,5). My plot should have a min of 1 and a max of 5 (there are 5 unique column words).

My problem, is I have no clue how to start this, but thinking about this the last few days. Ideas anyone?

How many times these frequency show up based on what I want:

A1 = 5
B1 = 5
C4 = 3

here is the str of my example and my real data looks like this but has 2366 obs. of 13 variables, various Factor w/ some number of levels (range from 200 : 3000)

str(test)
'data.frame':   3 obs. of  8 variables:
 $ Learn_1: Factor w/ 3 levels "A1","B1","C2": 1 2 3
 $ Car_1  : Factor w/ 3 levels "A1","B1","C1": 1 2 3
 $ Car_2  : Factor w/ 3 levels "A1","B2","C3": 1 2 3
 $ Fan_1  : Factor w/ 3 levels "A1","B1","C4": 1 2 3
 $ Fan_2  : Factor w/ 3 levels "A1","B1","C4": 1 2 3
 $ Fan_3  : Factor w/ 3 levels "A1","B2","C4": 1 2 3
 $ Kart_1 : Factor w/ 3 levels "A1","B1","C4": 1 2 3
 $ God_1  : Factor w/ 3 levels "A1","B1","C4": 1 2 3

Solution

  • We can use dplyr and tidyr.

    First the data is gathered to wide format, then we separate the numeric part from the labels, use distinct to remove duplicates, count all the occurances, and use left_join to only look at those that were in the God_1 column.

    library(dplyr)
    library(tidyr)
    dat %>% gather(key, val) %>%
            separate(key, c("id", "num")) %>% 
            distinct(id, val) %>%
            count(val) %>%
            left_join(dat["God_1"], ., by = c(God_1 = "val"))
    
    
    
    Source: local data frame [3 x 2]
    
       God_1   out
      (fctr) (dbl)
    1     A1     5
    2     B1     5
    3     C4     3