Search code examples
rcsvtidyversereadr

readr has no reasons to say there are less columns than expected


I'm trying to read the following CSV file from partyfacts with readr.

The import results in problems, but in reality there are no problems.

download.file("https://partyfacts.herokuapp.com/download/external-parties-csv/", "partyfacts-external-parties.csv")
df <- readr::read_csv("partyfacts-external-parties.csv", show_col_types = FALSE)

Warning: One or more parsing issues, call problems() on your data frame for details,

e.g.:

dat <- vroom(...)

problems(dat)

Let's see what we have:

nrow(problems(df))

86

problems(df)[1,]

# A tibble: 1 × 5 row col expected actual file
<int> <int> <chr> <chr> <chr>
35519 15 17 columns 15 columns /home/raffaele/Downloads/external-parties.csv

But in reality there are no problems.

Row 35519 is:

BIH,elecglob,292,SNSD,Alliance of Independent Social Democrats,Alliance of Independent Social Democrats,1998,2014,19.1,2006,,,2019-02-08 19:26:26.193233+00:00,2021-03-12 10:15:38.362019+00:00,30450,292,2019-02-08 19:26:26.296626+00:00

Which correctly contains 17 columns, not 15.

The other 84 problems are of the same nature (read less columns than expected) and a similar reasoning applies (the number of columns in the source file is correct).

EDIT: The text I reported for the line is from getting it from a text editor. Apparently the line numbers are not the same I get from R.


Solution

  • The file is huge, so it's hard to examine. A way to diagnose problems like this is to make the file smaller by deleting lines that are fine. I did that, and obtained this file, keeping only the first two lines, the first line that showed an error, and one line after that (which also shows an error):

    country,dataset_key,dataset_party_id,name_short,name,name_english,year_first,year_last,share,share_year,description,comment,created,modified,external_id,partyfacts_id,linked
    ALB,manifesto,75721,DBSH,E Djatha e Bashkuar e Shqipërisë,United Albanian Right,1996,1997,5.0,1996,,,2013-01-01 18:18:05.413000+00:00,2023-06-05 10:39:57.075788+00:00,1914,674,2013-01-01 18:33:17.889000+00:00
    BEN,gps,60,ABT,,Alliance pour un Benin triomphant,2011,2019,2.9,2015,,,2020-07-16 17:39:48.143406+00:00,2021-03-12 10:16:03.729055+00:00,47733
    BEN,gps,64,AE,,Eclaireur,2011,2019,3.7,2015,,,2020-07-16 17:39:57.563352+00:00,2021-03-12 10:16:03.731436+00:00,48035
    

    The third and fourth lines shown above were somewhere around line 35440 in the original file, and as you can see, they don't follow the same format as the previous line: the final two fields are missing.

    read.csv() doesn't complain about this file, because it is documented to fill in missing fields with blanks unless you call it with fill = FALSE. When I do that I get an error.