Search code examples
rrstudio

Filling missing column data based on other column data in R


I have a dataset which is filled with values for certain rows but then missing for others. I want to fill the missing data based on the values from other columns.

df=

        OFFENSE   GROUP                  DESCRIPTION              UCR
0       3126      NaN                    ASSAULT                  NaN
1       3831      NaN                    PROPERTY DAMAGE          NaN
2       724       NaN                    AUTO THEFT               NaN
3       301       NaN                    ROBBERY                  NaN
4       619       NaN                    LARCENY ALL OTHERS       NaN
245865  3115      Aggravated Assault     ASSAULT                  Part One
245866  619       Larceny                LARCENY ALL OTHERS       Part One
245867  2629      Harassment             HARASSMENT               Part Two
245868  2629      Harassment             HARASSMENT               Part Two
245869  3208      Property Lost          PROPERTY - MISSING       Part Three

For example, LARCENY ALL OTHERS should have GROUP as "Larceny" as well as the UCR "Part One". How do I fill these values with their correct values based on the filled columns?

I'm personally only interested in the ones with UCR "Part One", so I have tried filtering out the rows with UCR == 'Part One' and then made a list of the unique values in the DESCRIPTION column. I was then hoping to use a function to fill the other data based on if the values are in that list. But I cant think of what function to do.

dftest <- df[df$UCR == 'Part One',]
offenseslist <- unique(dftest$offence_name)

df %>% mutate(UCR =
                     case_when(df3$offence_name = offenseslist ~ "Part one")

Solution

  • You can replace the string 'NaN' with NA using NA_if(), then sort (arrange) the data by the desired columns so that NA values per GROUP and UCR come last and finally fillNA with the values one row above.

    Example data df:

    df <- structure(list(ID = c(0L, 1L, 2L, 3L, 4L, 245865L, 245866L, 245867L, 
    245868L, 245869L), OFFENSE = c(3126L, 3831L, 724L, 301L, 619L, 
    3115L, 619L, 2629L, 2629L, 3208L), GROUP = c("NaN", "NaN", "NaN", 
    "NaN", "NaN", "Aggravated Assault", "Larceny", "Harassment", 
    "Harassment", "Property Lost"), DESCRIPTION = c("ASSAULT", "PROPERTY DAMAGE", 
    "AUTO THEFT", "ROBBERY", "LARCENY ALL OTHERS", "ASSAULT", "LARCENY ALL OTHERS", 
    "HARASSMENT", "HARASSMENT", "PROPERTY - MISSING"), UCR = c("NaN", 
    "NaN", "NaN", "NaN", "NaN", "Part One", "Part One", "Part Two", 
    "Part Two", NA)), class = "data.frame", row.names = c(NA, 10L
    ))
    

    code:

    library(tidyr)
    library(dplyr)
    
    df %>%
      na_if('NaN') %>%
      arrange(DESCRIPTION, GROUP, UCR) %>%
      fill(GROUP, UCR, .direction = 'down')
    

    Note that fill only targets NA, hence the initial replacement of 'NaN' with NA.