I have 100+ csv files in some "random" format where I need to read only selected parts related to Activities and Sleep. The problem is that across files there is no set row number where either of those sections starts as well as the length of each section varies. In some cases, the heading may not be present, or be present but no data
I search stackoverflow and while reading from a particular row is not a problem, in this case that fixed row number is not available.. How can I do this?
We could first parse the file with meltr
to identify starting and ending rows for each section. And then extracts (relevant) parts from the file.
library(dplyr)
library(tidyr)
library(readr)
library(meltr)
library(purrr)
library(stringr)
csv_ <- "https://raw.githubusercontent.com/maria-pro/tutorials/main/496.csv"
# get row/col/type for each token, empty rows are included as NA values
(melted <- melt_csv(csv_))
#> # A tibble: 358 × 4
#> row col data_type value
#> <dbl> <dbl> <chr> <chr>
#> 1 1 1 character Body
#> 2 2 1 character Date
#> 3 2 2 character Weight
#> 4 2 3 character BMI
#> 5 2 4 character Fat
#> 6 3 1 character 26-06-2022
#> 7 3 2 integer 73
#> 8 3 3 double 29.62
#> 9 3 4 integer 50
#> 10 4 1 character 27-06-2022
#> # ℹ 348 more rows
# Keep only rows with no more than 1 column (including empty rows);
# get all records before the first "Food Log" section;
# fill NAs in value, recode data_type to start/end, pivot wider;
# adjust start_row and end_row to match header and last non-emtpy line of
# the section
csv_sections <- melted %>%
filter(max(col) == 1, .by = row) %>%
# handle files without "Food Log" entries by appending a fake record
add_row(row = Inf, value = "Food Log end of file marker") %>%
filter(row < row[str_detect(value,"Food Log") %>% which.max()]) %>%
fill(value) %>%
mutate(data_type = case_match(data_type, "character" ~ "start_row", "missing" ~ "end_row" )) %>%
pivot_wider(names_from = data_type, values_from = row) %>%
mutate(start_row = start_row + 1,
end_row = end_row - 1) %>%
select(section = value, start_row, end_row)
csv_sections
#> # A tibble: 4 × 3
#> section start_row end_row
#> <chr> <dbl> <dbl>
#> 1 Body 2 9
#> 2 Foods 12 19
#> 3 Activities 22 29
#> 4 Sleep 32 37
# read CSV as as list of lines
csv_lines <- read_lines(csv_)
# cycle though csv_sections and parse subsets of csv_lines rows as CSV;
# I() makes read_csv() recognize the input as literal data
sections <- pmap(csv_sections,
\(section, start_row, end_row) read_csv(I(csv_lines[start_row:end_row]),
show_col_types = FALSE)) %>%
set_names(csv_sections$section)
Results:
sections$Activities
#> # A tibble: 7 × 10
#> Date `Calories Burned` Steps Distance Floors `Minutes Sedentary`
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 26-06-2022 1411 0 0 0 1440
#> 2 27-06-2022 1411 0 0 0 1440
#> 3 28-06-2022 1411 0 0 0 1440
#> 4 29-06-2022 2128 8292 5.38 27 1291
#> 5 30-06-2022 1678 1016 0.66 9 488
#> 6 01-07-2022 1634 1078 0.7 10 790
#> 7 02-07-2022 1537 732 0.47 7 846
#> # ℹ 4 more variables: `Minutes Lightly Active` <dbl>,
#> # `Minutes Fairly Active` <dbl>, `Minutes Very Active` <dbl>,
#> # `Activity Calories` <dbl>
sections$Sleep
#> # A tibble: 5 × 9
#> `Start Time` `End Time` `Minutes Asleep` `Minutes Awake`
#> <chr> <chr> <dbl> <dbl>
#> 1 01-07-2022 8:26 pm 02-07-2022 5:35 am 473 76
#> 2 01-07-2022 6:53 am 01-07-2022 9:48 am 155 20
#> 3 30-06-2022 8:58 pm 01-07-2022 3:38 am 341 59
#> 4 30-06-2022 11:45 am 30-06-2022 4:00 pm 225 30
#> 5 29-06-2022 10:03 pm 30-06-2022 8:19 am 524 92
#> # ℹ 5 more variables: `Number of Awakenings` <dbl>, `Time in Bed` <dbl>,
#> # `Minutes REM Sleep` <chr>, `Minutes Light Sleep` <chr>,
#> # `Minutes Deep Sleep` <chr>
Created on 2023-05-28 with reprex v2.0.2