Search code examples
rdataframedata-cleaningdata-munging

Of two rows eliminate the one with more NAs


I am looking for a way to check wether two columns in a data frame contain the same elements for one or more rows, then eliminate the row containing more NAs.

Lets assume we have a data frame as such:

x <- data.frame("Year" = c(2017,2017,2017,2018,2018),
            "Country" = c("Sweden", "Sweden", "Norway", "Denmark", "Finland"),
            "Sales" = c(15, 15, 18, 13, 12),
            "Campaigns" = c(3, NA, 4, 1, 1),
            "Employees" = c(15, 15, 12, 8, 9),
            "Satisfaction" = c(0.8, NA, 0.9, 0.95, 0.87),
            "Expenses" = c(NA, NA, 9000, 7500, 4300))

Note that the entry for Sweden in the year 2017 is there twice, but the first row has one entry with NA while the other one contains NAs in three places. Now I would like to check wether two rows contain the same "Year" and "Country", then proceed to eliminate the row containing the higher amount of NAs, in this case the second row. I did some research but I could not seem to find a solution for this particular case.

Thank you very much in advance.


Solution

  • Using dplyr:

    library(dplyr)
    x %>%
      mutate(n_na = rowSums(is.na(.))) %>%  ## calculate NAs for each row      
      group_by(Year, Country) %>%           ## for each year/country
      arrange(n_na) %>%                       ## sort by number of NAs
      slice(1) %>%                            ## take the first row
      select(-n_na)                           ## remove the NA counter column
    # A tibble: 4 x 7
    # Groups:   Year, Country [4]
       Year Country Sales Campaigns Employees Satisfaction Expenses
      <dbl>  <fctr> <dbl>     <dbl>     <dbl>        <dbl>    <dbl>
    1  2017  Norway    18         4        12         0.90     9000
    2  2017  Sweden    15         3        15         0.80       NA
    3  2018 Denmark    13         1         8         0.95     7500
    4  2018 Finland    12         1         9         0.87     4300