Imagine the dataset:
df1 <- tibble::tribble(~City, ~Year, ~Coffee, ~Tea, ~Year, ~Sugar, ~At, ~Empty,
"NY", "2020", "", "", "2020", "2", "", "",
"NY", "2019", "5", "3", "2019", "5", "", "",
"City", "Year", "Coffee", "Tea", "Year", "Sugar", "", "",
"ATL", "2020", "", "", "2020", "2", "", "",
"ATL", "2019", "5", "3", "2019", "5", "", "",
"Data input by: Alex", "", "", "", "", "", "", "",
"BOS", "Year", "", "Coffee", "", "Tea", "Sugar", "",
"BOS", "2020", "", "7", "2020", "8", "3", "",
"BOS", "2019", "", "7", "2019", "7", "2", "",
"MS", "Year", "Frappacino", "Green Tea", "", "Coffee", "Sugar", "",
"MS", "2020", "5", "6", "2019", "8", "3", "",
"MS", "2019", "5", "5", "2020", "8", "3", "",
"City", "Year", "Coffee", "Tea", "Year", "Sugar", "At", "",
"HW", "2020", "500", "300", "2020", "200", "", "",
"HW", "2019", "450", "320", "2019", "180", "", "",
"Data input by: Aleksanteri", "", "", "", "", "", "", "",
"Kaupunki", "Vuosi", "Kahvi", "Tee", "Vuosi", "At", "Sokeri", "",
"HEL", "2020", "7", "4", "2018", "", "4", "",
"HEL", "2019", "7", "4", "2019", "", "4", "",
"HEL", "2018", "6", "3", "2020", "", "5", "")
The same dataset (for visual representation):
Problems:
Is there a simple method to deal with such data problems?
I have just faced this problem. I was solving it one by one, and then stopped with an idea there should be a simple method for doing everything.
A bit quick and dirty and you might need some finetuning if your data is even more variable than I assumed.
A key requirement in my approach is that your first row, at least City has the right name and that Year is named correctly in the first row and at known position (you could look it up by the names of your table. But years can not be skewed over the columns.
I use data.table
here
library(data.table)
setDT(df1)
# find the year columns
cols <- names(df1)
ypos <- last(which(cols %in% "Year"))
# split your data to deal with differences in years per row (for HEL)
dt <- rbindlist(list(df1[, 1L:(ypos-1L), with = F], df1[, c(1L, ypos:length(cols)), with = F]), fill = T)
# melt data and take the first row as headers we will fix later on
# this only works though when you know the Year column is never shifted to another column
dt <- melt.data.table(dt, id.vars = c("City", "Year"), variable.factor = F)
dt <- dt[!(Year == "" & value == ""),]
dt <- dt[, City := fifelse(City %in% c("City", "Kaupunki"), shift(City, type = "lead"), City)]
dt <- dt[!(value == "" | value == "At")]
v_f <- c("Kahvi", "Tee", "Sokeri")
v_t <- c("Coffee", "Tea", "Sugar")
dt[, value := str_replace_all(value,setNames(v_t, v_f))]
dt[, new_variable := first(value), rleid(City)]
dt[is.na(as.numeric(new_variable)), variable := new_variable][, new_variable := NULL]
dt[, value := as.numeric(value)]
dt <- dt[!is.na(value)]
dt[, value := fifelse(value > 100, value / 100, value)]
dcast(dt, City + Year ~ variable, value.var = "value")
Results
City Year Coffee Frappacino Green Tea Sugar Tea
1: ATL 2019 5.0 NA NA 5.0 3.0
2: ATL 2020 NA NA NA 2.0 NA
3: BOS 2019 7.0 NA NA 2.0 7.0
4: BOS 2020 7.0 NA NA 3.0 8.0
5: HEL 2018 6.0 NA NA 4.0 3.0
6: HEL 2019 7.0 NA NA 4.0 4.0
7: HEL 2020 7.0 NA NA 5.0 4.0
8: HW 2019 4.5 NA NA 1.8 3.2
9: HW 2020 5.0 NA NA 2.0 3.0
10: MS 2019 8.0 5 5 3.0 NA
11: MS 2020 8.0 5 6 3.0 NA
12: NY 2019 5.0 NA NA 5.0 3.0
13: NY 2020 NA NA NA 2.0 NA