Search code examples
rread.csv

importing part of .csv from selected rows identified by headings


I have 100+ csv files in some "random" format where I need to read only selected parts related to Activities and Sleep. The problem is that across files there is no set row number where either of those sections starts as well as the length of each section varies. In some cases, the heading may not be present, or be present but no data

To get the idea: enter image description here

I search stackoverflow and while reading from a particular row is not a problem, in this case that fixed row number is not available.. How can I do this?


Solution

  • We could first parse the file with meltr to identify starting and ending rows for each section. And then extracts (relevant) parts from the file.

    library(dplyr)
    library(tidyr)
    library(readr)
    library(meltr)
    library(purrr)
    library(stringr)
    
    csv_ <- "https://raw.githubusercontent.com/maria-pro/tutorials/main/496.csv"
    
    # get row/col/type for each token, empty rows are included as NA values
    (melted <-  melt_csv(csv_))
    #> # A tibble: 358 × 4
    #>      row   col data_type value     
    #>    <dbl> <dbl> <chr>     <chr>     
    #>  1     1     1 character Body      
    #>  2     2     1 character Date      
    #>  3     2     2 character Weight    
    #>  4     2     3 character BMI       
    #>  5     2     4 character Fat       
    #>  6     3     1 character 26-06-2022
    #>  7     3     2 integer   73        
    #>  8     3     3 double    29.62     
    #>  9     3     4 integer   50        
    #> 10     4     1 character 27-06-2022
    #> # ℹ 348 more rows
    
    # Keep only rows with no more than 1 column (including empty rows);
    # get all records before the first "Food Log" section;
    # fill NAs in value, recode data_type to start/end, pivot wider; 
    # adjust start_row and end_row to match header and last non-emtpy line of 
    # the section
    csv_sections <- melted %>% 
      filter(max(col) == 1, .by = row) %>% 
      # handle files without "Food Log" entries by appending a fake record
      add_row(row = Inf, value = "Food Log end of file marker") %>% 
      filter(row < row[str_detect(value,"Food Log") %>% which.max()]) %>% 
      fill(value) %>% 
      mutate(data_type = case_match(data_type, "character" ~ "start_row", "missing" ~ "end_row" )) %>% 
      pivot_wider(names_from = data_type, values_from = row) %>% 
      mutate(start_row = start_row + 1,
               end_row =   end_row - 1) %>% 
      select(section = value, start_row, end_row)
    
    csv_sections
    #> # A tibble: 4 × 3
    #>   section    start_row end_row
    #>   <chr>          <dbl>   <dbl>
    #> 1 Body               2       9
    #> 2 Foods             12      19
    #> 3 Activities        22      29
    #> 4 Sleep             32      37
    
    # read CSV as as list of lines
    csv_lines <- read_lines(csv_)
    # cycle though csv_sections and parse subsets of csv_lines rows as CSV;
    # I() makes read_csv() recognize the input as literal data
    sections <- pmap(csv_sections, 
                     \(section, start_row, end_row) read_csv(I(csv_lines[start_row:end_row]), 
                                                             show_col_types = FALSE)) %>% 
      set_names(csv_sections$section)
    

    Results:

    sections$Activities
    #> # A tibble: 7 × 10
    #>   Date       `Calories Burned` Steps Distance Floors `Minutes Sedentary`
    #>   <chr>                  <dbl> <dbl>    <dbl>  <dbl>               <dbl>
    #> 1 26-06-2022              1411     0     0         0                1440
    #> 2 27-06-2022              1411     0     0         0                1440
    #> 3 28-06-2022              1411     0     0         0                1440
    #> 4 29-06-2022              2128  8292     5.38     27                1291
    #> 5 30-06-2022              1678  1016     0.66      9                 488
    #> 6 01-07-2022              1634  1078     0.7      10                 790
    #> 7 02-07-2022              1537   732     0.47      7                 846
    #> # ℹ 4 more variables: `Minutes Lightly Active` <dbl>,
    #> #   `Minutes Fairly Active` <dbl>, `Minutes Very Active` <dbl>,
    #> #   `Activity Calories` <dbl>
    sections$Sleep
    #> # A tibble: 5 × 9
    #>   `Start Time`        `End Time`         `Minutes Asleep` `Minutes Awake`
    #>   <chr>               <chr>                         <dbl>           <dbl>
    #> 1 01-07-2022 8:26 pm  02-07-2022 5:35 am              473              76
    #> 2 01-07-2022 6:53 am  01-07-2022 9:48 am              155              20
    #> 3 30-06-2022 8:58 pm  01-07-2022 3:38 am              341              59
    #> 4 30-06-2022 11:45 am 30-06-2022 4:00 pm              225              30
    #> 5 29-06-2022 10:03 pm 30-06-2022 8:19 am              524              92
    #> # ℹ 5 more variables: `Number of Awakenings` <dbl>, `Time in Bed` <dbl>,
    #> #   `Minutes REM Sleep` <chr>, `Minutes Light Sleep` <chr>,
    #> #   `Minutes Deep Sleep` <chr>
    

    Created on 2023-05-28 with reprex v2.0.2