Search code examples
rdplyrtidyr

How to tidy this messy hurricanes dataset


I am trying to do a cluster analysis of hurricanes data with hurricane names as rownames. There are multiple entries for each hurricane and the info is entered in the time column. I can't figure out how to tidy this, Thanks in advance for the help.

    dput(head(hurr))
structure(list(date_index = c("AL011851", "18510625", "18510625", 
"18510625", "18510625", "18510625"), time = c("            UNNAMED", 
" 0000", " 0600", " 1200", " 1800", " 2100"), n_entries = c("     14", 
"  ", "  ", "  ", "  ", " L"), type = c("", " HU", " HU", " HU", 
" HU", " HU"), wind = c(NA, 80L, 80L, 80L, 80L, 80L), Column5 = c("", 
" 28.0N", " 28.0N", " 28.0N", " 28.1N", " 28.2N"), Column6 = c("", 
"  94.8W", "  95.4W", "  96.0W", "  96.5W", "  96.8W")), row.names = c(NA, 
6L), class = "data.frame")

The desired output should look like that:

       type      wind   Column5     Column6 
UNNAMED HU         80   28.0N       94.8W   
UNNAMED HU         80   28.0N       95.4W   
UNNAMED HU         80   28.0N       96.0W   
UNNAMED HU         80   28.1N       96.5W   
UNNAMED HU         80   28.2N       96.8W   

Solution

  • library(dplyr)
    quux %>%
      mutate(across(where(is.character), trimws)) %>%
      group_by(grp = cumsum(grepl("\\D", date_index))) %>%
      reframe(date_index = date_index[-1], type = paste(time[1], type[-1]), time = time[-1], across(c(wind, Column5, Column6), ~ .[-1])) %>%
      select(-grp) %>%
      mutate(across(everything(), ~ type.convert(., as.is = TRUE)))
    # # A tibble: 5 × 6
    #   date_index type        time  wind Column5 Column6
    #        <int> <chr>      <int> <int> <chr>   <chr>  
    # 1   18510625 UNNAMED HU     0    80 28.0N   94.8W  
    # 2   18510625 UNNAMED HU   600    80 28.0N   95.4W  
    # 3   18510625 UNNAMED HU  1200    80 28.0N   96.0W  
    # 4   18510625 UNNAMED HU  1800    80 28.1N   96.5W  
    # 5   18510625 UNNAMED HU  2100    80 28.2N   96.8W  
    
    • I kept date_index and time in case they have import and vary
    • I assumed that the first row of this group was meaningful, somewhat of a second-header-row: the first row looked different in several columns than the rest of the rows. With that, I guessed that you might have a repeat of that meaningful row, so I chose to group_by the presence of non-numbers in the first column. If the rest of the date_index naturally has characters other than 0-9, then this logic will need to be revisited, or removed if my assumptions was incorrect.
    • I chose to convert the number-like fields to numbers; the type.convert can easily be removed or restricted if you need to (say) keep time as strings.

    Data

    quux <- structure(list(date_index = c("AL011851", "18510625", "18510625", "18510625", "18510625", "18510625"), time = c("            UNNAMED", " 0000", " 0600", " 1200", " 1800", " 2100"), n_entries = c("     14", "  ", "  ", "  ", "  ", " L"), type = c("", " HU", " HU", " HU", " HU", " HU"), wind = c(NA, 80L, 80L, 80L, 80L, 80L), Column5 = c("", " 28.0N", " 28.0N", " 28.0N", " 28.1N", " 28.2N"), Column6 = c("", "  94.8W", "  95.4W", "  96.0W", "  96.5W", "  96.8W")), row.names = c(NA, 6L), class = "data.frame")