Search code examples
rreadr

read_tsv in readr not parsing table correctly


I'm trying to read in a tab separated table, which keeps producing some parsing failures. I think due to the use of un-backslashed quotes in the text. See below for an example:

concept_id  concept_name    domain_id   vocabulary_id   concept_class_id    standard_concept    concept_code    valid_start_date    valid_end_date  invalid_reason
2618087 Services delivered under an outpatient speech language pathology plan of care   Observation HCPCS   HCPCS Modifier  S   GN  19990101    20991231
2618083 "opt out" physician or practitioner emergency or urgent service Observation HCPCS   HCPCS Modifier  S   GJ  19981001    20991231
2618082 Diagnostic mammogram converted from screening mammogram on same day Observation HCPCS   HCPCS Modifier  S   GH  19981001    20991231

Note the "opt out" in the second column, where the problem seems to originate. The following code has parsing failures:

df <- read_delim(
  file = "~/_data/test.csv",
  col_types = cols(
    col_integer(), col_character(), col_character(),
    col_character(), col_character(), col_character(),
    col_character(), col_date(format = "%Y%m%d"), col_date(format = "%Y%m%d"),
    col_character()),
  delim = "\t"
  )

Warning: 4 parsing failures.
row          col                     expected    actual               file
  1 NA           10 columns                   9 columns '~/_data/test.csv'
  2 concept_name delimiter or quote                     '~/_data/test.csv'
  2 concept_name closing quote at end of file           '~/_data/test.csv'
  2 NA           10 columns                   2 columns '~/_data/test.csv'

I can't seem to specify a solution.


Solution

  • This resolves the issue. I needed to modify the quote argument to quote = ""

    df <- read_delim(
      file = "~/_data/test.csv",
      col_types = cols(
        col_integer(), col_character(), col_character(),
        col_character(), col_character(), col_character(),
        col_character(), col_date(format = "%Y%m%d"), col_date(format = "%Y%m%d"),
        col_character()),
      quote = "",
      delim = "\t"
      )