I am trying to do a cluster analysis of hurricanes data with hurricane names as rownames. There are multiple entries for each hurricane and the info is entered in the time column. I can't figure out how to tidy this, Thanks in advance for the help.
dput(head(hurr))
structure(list(date_index = c("AL011851", "18510625", "18510625",
"18510625", "18510625", "18510625"), time = c(" UNNAMED",
" 0000", " 0600", " 1200", " 1800", " 2100"), n_entries = c(" 14",
" ", " ", " ", " ", " L"), type = c("", " HU", " HU", " HU",
" HU", " HU"), wind = c(NA, 80L, 80L, 80L, 80L, 80L), Column5 = c("",
" 28.0N", " 28.0N", " 28.0N", " 28.1N", " 28.2N"), Column6 = c("",
" 94.8W", " 95.4W", " 96.0W", " 96.5W", " 96.8W")), row.names = c(NA,
6L), class = "data.frame")
The desired output should look like that:
type wind Column5 Column6
UNNAMED HU 80 28.0N 94.8W
UNNAMED HU 80 28.0N 95.4W
UNNAMED HU 80 28.0N 96.0W
UNNAMED HU 80 28.1N 96.5W
UNNAMED HU 80 28.2N 96.8W
library(dplyr)
quux %>%
mutate(across(where(is.character), trimws)) %>%
group_by(grp = cumsum(grepl("\\D", date_index))) %>%
reframe(date_index = date_index[-1], type = paste(time[1], type[-1]), time = time[-1], across(c(wind, Column5, Column6), ~ .[-1])) %>%
select(-grp) %>%
mutate(across(everything(), ~ type.convert(., as.is = TRUE)))
# # A tibble: 5 × 6
# date_index type time wind Column5 Column6
# <int> <chr> <int> <int> <chr> <chr>
# 1 18510625 UNNAMED HU 0 80 28.0N 94.8W
# 2 18510625 UNNAMED HU 600 80 28.0N 95.4W
# 3 18510625 UNNAMED HU 1200 80 28.0N 96.0W
# 4 18510625 UNNAMED HU 1800 80 28.1N 96.5W
# 5 18510625 UNNAMED HU 2100 80 28.2N 96.8W
date_index
and time
in case they have import and varygroup_by
the presence of non-numbers in the first column. If the rest of the date_index
naturally has characters other than 0-9
, then this logic will need to be revisited, or removed if my assumptions was incorrect.type.convert
can easily be removed or restricted if you need to (say) keep time
as strings.Data
quux <- structure(list(date_index = c("AL011851", "18510625", "18510625", "18510625", "18510625", "18510625"), time = c(" UNNAMED", " 0000", " 0600", " 1200", " 1800", " 2100"), n_entries = c(" 14", " ", " ", " ", " ", " L"), type = c("", " HU", " HU", " HU", " HU", " HU"), wind = c(NA, 80L, 80L, 80L, 80L, 80L), Column5 = c("", " 28.0N", " 28.0N", " 28.0N", " 28.1N", " 28.2N"), Column6 = c("", " 94.8W", " 95.4W", " 96.0W", " 96.5W", " 96.8W")), row.names = c(NA, 6L), class = "data.frame")