Search code examples
rdataframemediandata-wrangling

R - Add values (derived by a formula) in a dataframe column based on a condition met by values in a column of another dataframe


Here is an example dataset:

data = data.frame('Cat' = c('A', 'A', 'A', 'B', 'B', 'C', 'C', 'C', 'C', 'C'),
                  'Value' = c(1,1,1,2,2,3,3,3,3,3))
data

enter image description here

Another dataframe:

a = data.frame('Name' = c('A', 'B', 'C', 'D'))

Desired output:

enter image description here

I want to understand how to give reference of another cell within the same row of a dataframe, and perform some function using the value of that cell.

This worked for "In Data":

a[,'In Data?'] = ifelse(a$Name %in% unique(data$Cat), "Y", "N")

This failed for median:

b$Median = median(data$Cat[data$Cat == a$Name])

Error message:
Error in Ops.factor(data$Cat, a$Name) : 
  level sets of factors are different

This failed for count:

a$Count = ifelse(a$Name %in% unique(data$Cat), length(data$Cat==a$Name), 0)

Error:
Error in Ops.factor(data$Cat, a$Name) : 
  level sets of factors are different

. . 2nd Dataframe columns :

  1. Cat : A B C D
  2. count :
  3. proportion :
  4. median :
  5. values > median :
  6. f(x) : {count + 10}
  7. In Data? :

Solution

  • It's better to frame these operations as merging and summarizing. (Talking in terms of cells and rows seem very Excel-like rather than R-like). The dplyr package helps a lot here

    library(dplyr)
    a %>% 
      left_join(data, by=c("Name"="Cat")) %>% 
      group_by(Name) %>% 
      summarize(
        Count=sum(!is.na(Value)),
        Median=median(Value),
        ValuesGtMed=sum(Value>Median),
        f = Count+10,
        InData = if_else(Count>0, "Y","N")
      ) %>% 
      mutate(Proportion=Count/sum(Count))
    

    The left_join makes sure we get all values in a and then we just use different summary functions per the groups defined by Name

    Output:

      Name  Count Median ValuesGtMed     f InData Proportion
      <chr> <int>  <dbl>       <int> <dbl> <chr>       <dbl>
    1 A         3      1           0    13 Y             0.3
    2 B         2      2           0    12 Y             0.2
    3 C         5      3           0    15 Y             0.5
    4 D         0     NA          NA    10 N             0