Search code examples
rdata-cleaningdata-mining

Easiest method to clean complex data


Imagine the dataset:

df1 <- tibble::tribble(~City,   ~Year,  ~Coffee,    ~Tea,   ~Year,  ~Sugar, ~At,    ~Empty,
"NY",   "2020", "", "", "2020", "2",    "", "",
"NY",   "2019", "5",    "3",    "2019", "5",    "", "",
"City", "Year", "Coffee",   "Tea",  "Year", "Sugar",    "", "",
"ATL",  "2020", "", "", "2020", "2",    "", "",
"ATL",  "2019", "5",    "3",    "2019", "5",    "", "",
"Data input by: Alex",  "", "", "", "", "", "", "",
"BOS",  "Year", "", "Coffee",   "", "Tea",  "Sugar",    "",
"BOS",  "2020", "", "7",    "2020", "8",    "3",    "",
"BOS",  "2019", "", "7",    "2019", "7",    "2",    "",
"MS",   "Year", "Frappacino",   "Green Tea",    "", "Coffee",   "Sugar",    "",
"MS",   "2020", "5",    "6",    "2019", "8",    "3",    "",
"MS",   "2019", "5",    "5",    "2020", "8",    "3",    "",
"City", "Year", "Coffee",   "Tea",  "Year", "Sugar",    "At",   "",
"HW",   "2020", "500",  "300",  "2020", "200",  "", "",
"HW",   "2019", "450",  "320",  "2019", "180",  "", "",
"Data input by: Aleksanteri",   "", "", "", "", "", "", "",
"Kaupunki", "Vuosi",    "Kahvi",    "Tee",  "Vuosi",    "At",   "Sokeri",   "",
"HEL",  "2020", "7",    "4",    "2018", "", "4",    "",
"HEL",  "2019", "7",    "4",    "2019", "", "4",    "",
"HEL",  "2018", "6",    "3",    "2020", "", "5",    "")

The same dataset (for visual representation): enter image description here

Problems:

  1. City ATL is error (it is a copy of NY) [but we can not to know if such pattern exist]
  2. There are two people inputting the data to the original database (red rows, first column, but can be anywhere within the row)
  3. BOS is shifted by one cell starting from year; the second name "Year" is missing
  4. MS has Frappacino and Green Tea but no Tea data :(
  5. MS has coffee data flip around by years (2020 is 2019 and 2019 is 2020)
  6. HW has wrong currency input (too different numbers); the proper currency is X/100 (500 HW is 5 (500/100))
  7. HEL has Finnish names of the columns in the input (and we would like to control the data is not shifted); so that we know: Vuosi = Year; Kaupunki = City; Kahvi = Coffee; Tee = Tea; Sokeri = Sugar :)
  8. HEL has a flip-around in years for Sokeri (Sugar) compared to Kahvi (Coffee), and Tee (Tea).
  • Empty column is fully empty (it can be place for potential shifts in data [not our case].

Is there a simple method to deal with such data problems?

I have just faced this problem. I was solving it one by one, and then stopped with an idea there should be a simple method for doing everything.


Solution

  • A bit quick and dirty and you might need some finetuning if your data is even more variable than I assumed.

    A key requirement in my approach is that your first row, at least City has the right name and that Year is named correctly in the first row and at known position (you could look it up by the names of your table. But years can not be skewed over the columns.

    I use data.table here

    library(data.table)
    
    setDT(df1)
    
    # find the year columns
    cols <- names(df1)
    ypos <- last(which(cols %in% "Year"))
    
    # split your data to deal with differences in years per row (for HEL)
    dt <- rbindlist(list(df1[, 1L:(ypos-1L), with = F], df1[, c(1L, ypos:length(cols)), with = F]), fill = T)
    
    # melt data and take the first row as headers we will fix later on
    # this only works though when you know the Year column is never shifted to another column
    dt <- melt.data.table(dt, id.vars = c("City", "Year"), variable.factor = F)
    
    dt <- dt[!(Year == "" & value == ""),]
    dt <- dt[, City := fifelse(City %in% c("City", "Kaupunki"), shift(City, type = "lead"), City)]
    dt <- dt[!(value == "" | value == "At")]
    
    v_f <- c("Kahvi", "Tee", "Sokeri")
    v_t <- c("Coffee", "Tea", "Sugar")
    dt[, value := str_replace_all(value,setNames(v_t, v_f))]
    
    dt[, new_variable := first(value), rleid(City)]
    dt[is.na(as.numeric(new_variable)), variable := new_variable][, new_variable := NULL]
    dt[, value := as.numeric(value)]
    dt <- dt[!is.na(value)]
    dt[, value := fifelse(value > 100, value / 100, value)]
    
    dcast(dt, City + Year ~ variable, value.var = "value")
    

    Results

        City Year Coffee Frappacino Green Tea Sugar Tea
     1:  ATL 2019    5.0         NA        NA   5.0 3.0
     2:  ATL 2020     NA         NA        NA   2.0  NA
     3:  BOS 2019    7.0         NA        NA   2.0 7.0
     4:  BOS 2020    7.0         NA        NA   3.0 8.0
     5:  HEL 2018    6.0         NA        NA   4.0 3.0
     6:  HEL 2019    7.0         NA        NA   4.0 4.0
     7:  HEL 2020    7.0         NA        NA   5.0 4.0
     8:   HW 2019    4.5         NA        NA   1.8 3.2
     9:   HW 2020    5.0         NA        NA   2.0 3.0
    10:   MS 2019    8.0          5         5   3.0  NA
    11:   MS 2020    8.0          5         6   3.0  NA
    12:   NY 2019    5.0         NA        NA   5.0 3.0
    13:   NY 2020     NA         NA        NA   2.0  NA