I'm trying to read directly from a URL to grab a zip file that contains a pipe delimited text file. If I download the file, then use read_csv
to read it from disk, I have no problems. But if I try to use read_csv
to read the URL directly I get garbage in my resulting df. I can work around this by coding in a download then read. But it seems like it should work directly. Any clues on what's going on here?
library(readr)
url <- "https://www.rma.usda.gov/data/sob/sccc/sobcov_2018.zip"
df <- read_delim(url, delim='|',
col_names = c('year','stFips','stAbbr','coFips','coName',
'cropCd','cropName','planCd','planAbbr','coverCat',
'deliveryType','covLevel','policyCount','policyPremCount','policyIndemCount',
'unitsReportingPrem', 'indemCount','quantType', 'quantNet', 'companionAcres',
'liab','prem','subsidy','indem', 'lossRatio'))
#> Parsed with column specification:
#> cols(
#> .default = col_character()
#> )
#> See spec(...) for full column specifications.
#> Warning in rbind(names(probs), probs_f): number of columns of result is not
#> a multiple of vector length (arg 1)
#> Warning: 7908 parsing failures.
#> row # A tibble: 5 x 5 col row col expected actual file expected <int> <chr> <chr> <chr> <chr> actual 1 1 year "" embedded null 'https://www.rma.usda.gov/data/sob… file 2 1 <NA> 25 columns 1 columns 'https://www.rma.usda.gov/data/sob… row 3 2 <NA> 25 columns 4 columns 'https://www.rma.usda.gov/data/sob… col 4 3 <NA> 25 columns 2 columns 'https://www.rma.usda.gov/data/sob… expected 5 4 year "" embedded null 'https://www.rma.usda.gov/data/sob…
#> ... ................. ... .......................................................................... ........ .......................................................................... ...... .......................................................................... .... .......................................................................... ... .......................................................................... ... .......................................................................... ........ ..........................................................................
#> See problems(...) for more details.
head(df)
#> # A tibble: 6 x 25
#> year stFips stAbbr coFips coName cropCd cropName planCd planAbbr
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 "PK\u00… <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> 2 "K\xe6\… "\xf5\x… "\xc5\… "\xfa\… <NA> <NA> <NA> <NA> <NA>
#> 3 "\xb0\x… "\xfd\x… <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> 4 "j`/Q\x… "\x96\x… <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> 5 "\xc0\x… <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> 6 "z\xe4\… "~y\xf5… <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> # ... with 16 more variables: coverCat <chr>, deliveryType <chr>,
#> # covLevel <chr>, policyCount <chr>, policyPremCount <chr>,
#> # policyIndemCount <chr>, unitsReportingPrem <chr>, indemCount <chr>,
#> # quantType <chr>, quantNet <chr>, companionAcres <chr>, liab <chr>,
#> # prem <chr>, subsidy <chr>, indem <chr>, lossRatio <chr>
If I download first, I get the following output:
> url <- './data/sobcov_2018.zip'
> df <- read_delim(url, delim='|',
+ col_names = c('year','stFips','stAbbr','coFips','coName',
+ 'cropCd','cropName','planCd','planAbbr','coverCat',
+ 'deliveryType','covLevel','policyCount','policyPremCount','policyIndemCount',
+ 'unitsReportingPrem', 'indemCount','quantType', 'quantNet', 'companionAcres',
+ 'liab','prem','subsidy','indem', 'lossRatio'))
Parsed with column specification:
cols(
.default = col_integer(),
stFips = col_character(),
stAbbr = col_character(),
coFips = col_character(),
coName = col_character(),
cropCd = col_character(),
cropName = col_character(),
planCd = col_character(),
planAbbr = col_character(),
coverCat = col_character(),
deliveryType = col_character(),
covLevel = col_double(),
quantType = col_character(),
lossRatio = col_double()
)
See spec(...) for full column specifications.
> head(df)
# A tibble: 6 x 25
year stFips stAbbr coFips coName cropCd cropName planCd planAbbr coverCat deliveryType covLevel
<int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
1 2018 02 AK 999 "All Other … 9999 "All Other C… 01 "YP … "A " RBUP 0.500
2 2018 02 AK 240 "Southeast … 9999 "All Other C… 90 "APH … "A " RBUP 0.500
3 2018 02 AK 240 "Southeast … 9999 "All Other C… 90 "APH … "A " RBUP 0.750
4 2018 02 AK 240 "Southeast … 9999 "All Other C… 90 "APH … "C " RCAT 0.500
5 2018 02 AK 240 "Southeast … 9999 "All Other C… 02 "RP … "A " RBUP 0.600
6 2018 02 AK 240 "Southeast … 9999 "All Other C… 02 "RP … "A " RBUP 0.750
# ... with 13 more variables: policyCount <int>, policyPremCount <int>, policyIndemCount <int>,
# unitsReportingPrem <int>, indemCount <int>, quantType <chr>, quantNet <int>, companionAcres <int>,
# liab <int>, prem <int>, subsidy <int>, indem <int>, lossRatio <dbl>
>
readr
can handle only gz
compressed files as remote sources, since there are no analogues to base::gzcon()
for other compression algorithms. See this github issue for a discussion and the improved documentation (also in ?readr::datasource
).