Search code examples
rdataframetidyversereadr

How to convert a string of space delimited to a data frame in r


I scraped this data from the OCC website and got returned an ascii file that is space delimited. I am looking to turn this string into a data frame.

I have tried using read.table, readr::read_tsv, but I am not getting the results desired. Below is the code to get acess to the data I am looking to convert.

  library(rvest)
  library(readr)

  data =  read_html('https://www.theocc.com/webapps/series-search? 
  symbolType=U&symbol=AAPL')%>%html_text()

  x = read.table(data, header = T) 
  x = read_tsv(data)   

I would have expected t osee the result come out as a data frame BUT instead read.table() prints the result to the console with a error and warning message.


Solution

  • The downloaded file contains descriptive content above the header; actually 6 lines:

    Series Search Results for AAPL
    
    Products for this underlying symbol are traded on: 
    AMEX ARCA BATS BOX C2 CBOE EDGX GEM ISE MCRY MIAX MPRL NOBO NSDQ PHLX 
    
            Series/contract     Strike          Open Interest           
    ProductSymbol   year    Month   Day Integer Dec C/P Call    Put Position Limit  
    AAPL        2019    01  25  100 000 C P     0   190 25000000
    AAPL        2019    01  25  105 000 C P     0   127 25000000
    AAPL        2019    01  25  110 000 C P     0   87  25000000
    AAPL        2019    01  25  115 000 C P     0   314 25000000
    ...
    

    You can read it via read_tsv(skip = 6):

    library(rvest)
    library(readr)
    
    df <- read_html(
      'https://www.theocc.com/webapps/series-search?symbolType=U&symbol=AAPL'
    ) %>% 
      html_text() %>% 
      read_tsv(
        skip = 6
      )
    

    However, the first column has a wide header and there's multiple (2) TABs separating it from the next column, resulting in

    enter image description here

    You'll have to do some massaging:

    dfnames <- names(df)[1:10]
    df <- df %>% 
      select(-year)
    names(df) <- dfnames
    

    enter image description here