I have a dataframe that looks kind of looks like this:
y<-c("A1","B1", "C2", "A1", "B1","C1", "A1","B2", "C3", "A1", "B1", "C4", "A1", "B1","C4", "A1","B2", "C4", "A1","B1", "C4", "A1", "B1", "C4")
test<- data.frame(matrix(y, nrow = 3, ncol = 8))
colnames(test) <- c("Learn_1", "Car_1", "Car_2", "Fan_1", "Fan_2", "Fan_3","Kart_1", "God_1")
test
Learn_1 Car_1 Car_2 Fan_1 Fan_2 Fan_3 Kart_1 God_1
1 A1 A1 A1 A1 A1 A1 A1 A1
2 B1 B1 B2 B1 B1 B2 B1 B1
3 C2 C1 C3 C4 C4 C4 C4 C4
My real data has 13 columns of unequal length and thousands of rows and the values are mixed up. I want to determine the frequency of each value in God_1 to all the other columns, but for each column with the same word (means columns are from same study) (i.e. columns Fan and Car count the value frequency as 1 if that value shows up more than once within those columns. Then I want to plot the percent of values that show up 5,4,3,2,1 over the total percent (100%) of values available in GOD_1. I am thinking a box the shows the total number of values, then different percent shades that discriminate the frequency values (1,2,3,4,5). My plot should have a min of 1 and a max of 5 (there are 5 unique column words).
My problem, is I have no clue how to start this, but thinking about this the last few days. Ideas anyone?
How many times these frequency show up based on what I want:
A1 = 5
B1 = 5
C4 = 3
here is the str of my example and my real data looks like this but has 2366 obs. of 13 variables, various Factor w/ some number of levels (range from 200 : 3000)
str(test)
'data.frame': 3 obs. of 8 variables:
$ Learn_1: Factor w/ 3 levels "A1","B1","C2": 1 2 3
$ Car_1 : Factor w/ 3 levels "A1","B1","C1": 1 2 3
$ Car_2 : Factor w/ 3 levels "A1","B2","C3": 1 2 3
$ Fan_1 : Factor w/ 3 levels "A1","B1","C4": 1 2 3
$ Fan_2 : Factor w/ 3 levels "A1","B1","C4": 1 2 3
$ Fan_3 : Factor w/ 3 levels "A1","B2","C4": 1 2 3
$ Kart_1 : Factor w/ 3 levels "A1","B1","C4": 1 2 3
$ God_1 : Factor w/ 3 levels "A1","B1","C4": 1 2 3
We can use dplyr
and tidyr
.
First the data is gather
ed to wide format, then we separate
the numeric part from the labels, use distinct
to remove duplicates, count all the occurances, and use left_join to only look at those that were in the God_1 column.
library(dplyr)
library(tidyr)
dat %>% gather(key, val) %>%
separate(key, c("id", "num")) %>%
distinct(id, val) %>%
count(val) %>%
left_join(dat["God_1"], ., by = c(God_1 = "val"))
Source: local data frame [3 x 2]
God_1 out
(fctr) (dbl)
1 A1 5
2 B1 5
3 C4 3