Search code examples
rcsvread.table

R read.table function not working properly with tab separated data


Here are my data (it takes some SECONDS to download, please be patient):

library(dplyr)
mydata <- "https://pxdata.stat.fi:443/PxWeb/sq/87e44319-48f8-41b4-bd0d-a6629dc7829c" %>%
    paste0(".relational_table") %>% read.table(sep = "\t", header = T)

Now, some of the rows look as they should, e.g.

> head(mydata)
  Underlying.cause.of.death..ICD.10..3.character.level.   Age Year     Sex Information Deaths
1                                         A00-Y89 Total Total 2022   Total      Deaths  63172
2                                         A00-Y89 Total Total 2022   Males      Deaths  31703
3                                         A00-Y89 Total Total 2022 Females      Deaths  31469
4                                         A00-Y89 Total     0 2022   Total      Deaths     91
5                                         A00-Y89 Total     0 2022   Males      Deaths     52
6                                         A00-Y89 Total     0 2022 Females      Deaths     39

However, there are rows that do not look so goood:

> mydata %>% filter(grepl("\t",Underlying.cause.of.death..ICD.10..3.character.level.)) %>% head
                                          Underlying.cause.of.death..ICD.10..3.character.level.   Age Year     Sex Information Deaths
1   A30 Leprosy (Hansens disease)\tTotal\t2022\tTotal\tDeaths\t0\nA30 Leprosy (Hansens disease) Total 2022   Males      Deaths      0
2 A30 Leprosy (Hansens disease)\tTotal\t2022\tFemales\tDeaths\t0\nA30 Leprosy (Hansens disease)     0 2022   Total      Deaths      0
3       A30 Leprosy (Hansens disease)\t0\t2022\tMales\tDeaths\t0\nA30 Leprosy (Hansens disease)     0 2022 Females      Deaths      0
4   A30 Leprosy (Hansens disease)\t1 - 4\t2022\tTotal\tDeaths\t0\nA30 Leprosy (Hansens disease) 1 - 4 2022   Males      Deaths      0
5 A30 Leprosy (Hansens disease)\t1 - 4\t2022\tFemales\tDeaths\t0\nA30 Leprosy (Hansens disease) 5 - 9 2022   Total      Deaths      0
6   A30 Leprosy (Hansens disease)\t5 - 9\t2022\tMales\tDeaths\t0\nA30 Leprosy (Hansens disease) 5 - 9 2022 Females      Deaths      0

Any ideas, why this is happening? If read.table should use "\t" as a column separator, why on earth it is pasting the original rows as follows, and this happens only for some of the rows?

Is there a better function that could read these these data to a table properly?

(I am using Windows 10, if that could have anything to do with this issue.)


Solution

  • This is an issue with some single (or double) quotes as in Hansen's disease, which makes read.table believe that the whole row is in the same cell. Adding quote = "" as an argument to read.table has the effect of ignoring all quoting characters.

    mydata <- read.table(mydata, sep = "\t", header = T, quote = "")
    

    Raw data

    library(dplyr)
    mydata <- "https://pxdata.stat.fi:443/PxWeb/sq/87e44319-48f8-41b4-bd0d-a6629dc7829c" %>%
      paste0(".relational_table")