Search code examples
rdataframefrequencypercentage

How can I find the frequency of characters in percentage given a big dataframe?


So, I have a big data frame which I imported from Excel to RStudio, its dimension is along the 2000x124. I'm able to filter this 2000 so that I work with smaller groups of 100sx124. On this dataframe each row is an individual and the columns show values attributed to each individual on each.

I have a few columns defining populations, and the rest of the columns are different values given to each individual. The values are either 0|0, 0|1, 1|0 or 1|1, and those span from column 4 to column 124. I need to find the percentage of the frequencies on each column. For example, if the table was as such:

ID Column A Column B
person1 0/0 0/0
person2 0/1 1/1
person3 0/0 1/1
person4 1/0 1/0
person5 1/1 1/1
person6 1/0 0/0
person7 0/0 0/0

I need to find the frequency in percentage on each column, both from the bigger values and the individual numbers on each. What I mean is, on column A there is 3 0|0, 2 1|0, 1 1|1 and 1 0|1. That gives me 42.8%, 28.5%, 14.3% and 14.3%. What I also need is the individual numbers frequency and percentage, 9 0s and 5 1s, roughly 64% and 36%.

I was able to use sumtable from vtable package as well as tbl_summary from gtsummary package to make a table with the values %, but I still couldn't find how to get those individual number frequency and percentages... I want to either add rows on the data frame with those percentages or to make a table that could count that. Could anyone help me?


Solution

  • It's easier if you pivot the data to make it long. Then you can group by the old column name and the value to count and calculate the relative frequencies. Then you can pivot wider again.

    library(tidyverse)
    
    df <- read_table("ID    Column_A    Column_B
    person1 0/0 0/0
    person2 0/1 1/1
    person3 0/0 1/1
    person4 1/0 1/0
    person5 1/1 1/1
    person6 1/0 0/0
    person7 0/0 0/0")
    
    df %>% 
        pivot_longer(cols = -ID, names_to = "Column", values_to = "Value") %>% 
        group_by(Column, Value) %>% 
        summarize(freq = n(), .groups = "drop_last") %>% 
        mutate(percent = freq/sum(freq)) %>% 
        pivot_wider(names_from = "Column", values_from = c("freq", "percent"), values_fill = list(freq = 0L, percent = 0))