Search code examples
rtidyverseapache-arrow

Is it possible to skip a paragraph using arrow::open_dataset in r?


I have 20 datasets, and some of them have introductions in the first few rows. Since not all the dataset have introduction and the number of rows of introductions from different datasets may not be the same, therefore skip_rows may not be useful. Is it possible to catch the keywords and start reading from the row that contains keywords?

Sample dataset:

dataset 1:

balabala balabala...
A header Another header
First row
Second row

dataset 2:

A header Another header
First row
Second row

dataset 3:

|balabala | balabala... | |balabala | balabala... | | -------- | -------------- | | A header | Another header | | First | row | | Second | row |

etc...

What I want:

dataset 1:

A header Another header
First row
Second row

dataset 2:

A header Another header
First row
Second row

dataset 3:

A header Another header
First row
Second row

etc...


Solution

  • You may try

    library(dplyr)
    library(janitor)
    
    df1 <- read.table(text = "balabala  balabala...
    'A header'  'Another header'
    First   row
    Second  row", header = T)
    
    df2 <- read.table(text = "'A header'    'Another header'
    First   row
    Second  row", header = T, check.names = F)
    
    df3 <- read.table(text = "balabala  balabala...
    balabala    balabala...
    'A header'  'Another header'
    First   row 
    Second  row", header = T)
    
    header_vector <- c('A header', 'Another header')
    
    ftn <- function(df){
      if (all(names(df) == header_vector)) {
        df
      } else {
        df$key = apply(df, 1, function(x) {all(x == header_vector)})
        df %>%
          mutate(key = cumsum(key)) %>%
          filter(key >= 1) %>% select(-key) %>%
          janitor::row_to_names(row_number = 1) 
      }
      
    }
    
    ftn(df1)
      A header Another header
    2    First            row
    3   Second            row
    
    
    ftn(df2)
      A header Another header
    1    First            row
    2   Second            row
    
    ftn(df3)
      A header Another header
    2    First            row
    3   Second            row