Search code examples
raggregate

collapsing columns values in a specific order and leaving the missing values as NA in R


I am using R.

I have 4 different databases. Each one have values for my variables. Some of the bases have more values than others. So I want to use first the one that has the most values and lastly the one that have the least values. The data looks like this...

Variables   A   B   C   D

John        2   4   
Mike            6
Walter          7
Jennifer        9   8
Amanda      3
Carlos      9
Michael         3
James                   5
Kevin       4
Dennis              7
Frank
Steven
Joseph
Elvis           2
Maria           1

So, in roder to fill the data a need to create a new column that first uses the data of column B because is the one that contains the most values, then A, then C and then D and the ones that are missing need to be NA's. Also I need to add another column that gives me the reference of the data. In other words if I am using the column B to the that of John I need a column that tells me that the data pertains to column B.

The column should look like this...

Variables   E   D

John        4   B
Mike        6   B
Walter      7   B
Jennifer    9   B
Amanda      3   A
Carlos      9   A
Michael     3   B
James       5   D
Kevin       4   A
Dennis      7   C
Frank       NA  NA
Steven      NA  NA
Joseph      NA  NA
Elvis       2   B
Maria       1   B

Solution

  • With tidyverse you can do the following...

    Use pivot_longer to put into long form. Make name an ordered factor by "B", "A", "C", and "D". Then when you arrange, you can get the first value by this order within each person's name.

    This assumes your missing data are NA. If they are instead blank character values, you can filter those out with filter(value != "") instead of drop_na(value).

    library(tidyverse)
    
    df %>%
      pivot_longer(cols = -Variables) %>%
      mutate(name = ordered(name, levels = c('B', 'A', 'C', 'D'))) %>%
      group_by(Variables) %>%
      drop_na(value) %>%
      arrange(name) %>%
      summarise(E = first(value),
                New_D = first(name)) %>%
      right_join(df)
    

    Output

       Variables     E New_D     A     B     C     D
       <chr>     <dbl> <ord> <dbl> <dbl> <dbl> <dbl>
     1 Amanda        3 A         3    NA    NA    NA
     2 Carlos        9 A         9    NA    NA    NA
     3 Dennis        7 C        NA    NA     7    NA
     4 Elvis         2 B        NA     2    NA    NA
     5 James         5 D        NA    NA    NA     5
     6 Jennifer      9 B        NA     9     8    NA
     7 John          4 B         2     4    NA    NA
     8 Kevin         4 A         4    NA    NA    NA
     9 Maria         1 B        NA     1    NA    NA
    10 Michael       3 B        NA     3    NA    NA
    11 Mike          6 B        NA     6    NA    NA
    12 Walter        7 B        NA     7    NA    NA
    13 Frank        NA NA       NA    NA    NA    NA
    14 Steven       NA NA       NA    NA    NA    NA
    15 Joseph       NA NA       NA    NA    NA    NA
    

    Data

    df <- structure(list(Variables = c("John", "Mike", "Walter", "Jennifer", 
    "Amanda", "Carlos", "Michael", "James", "Kevin", "Dennis", "Frank", 
    "Steven", "Joseph", "Elvis", "Maria"), A = c(2, NA, NA, NA, 3, 
    9, NA, NA, 4, NA, NA, NA, NA, NA, NA), B = c(4, 6, 7, 9, NA, 
    NA, 3, NA, NA, NA, NA, NA, NA, 2, 1), C = c(NA, NA, NA, 8, NA, 
    NA, NA, NA, NA, 7, NA, NA, NA, NA, NA), D = c(NA, NA, NA, NA, 
    NA, NA, NA, 5, NA, NA, NA, NA, NA, NA, NA)), class = "data.frame", row.names = c(NA, 
    -15L))