Search code examples
rapache-arrowdbplyrduckdb

How to connect to a remote csv file with duckdb or arrow in R?


Goal

Connect to a large remote csv file to query a subset of the data.

This page has a button to download a csv file. If I copy the link and run the following, the data is loaded into memory:

foo <- data.table::fread("https://datahub.transportation.gov/api/views/8ect-6jqj/rows.csv?date=20231118&accessType=DOWNLOAD")

Trying duckdb via dbplyr:

library(DBI)
library(dbplyr)
library(tidyverse)
library(arrow)
library(duckdb)

(con <- DBI::dbConnect(duckdb::duckdb(), 
                  dbdir = ":memory:",
                  "https://datahub.transportation.gov/api/views/8ect-6jqj/rows.csv?date=20231118&accessType=DOWNLOAD"))
# <duckdb_connection f96e0 driver=<duckdb_driver fd1f0 dbdir=':memory:' read_only=FALSE bigint=numeric>>

But this results in:

> dbListTables(con)
character(0)

Trying arrow:

 > open_dataset(sources = "https://datahub.transportation.gov/api/views/8ect-6jqj/rows.csv?date=20231118&accessType=DOWNLOAD")
Error: Invalid: Unrecognized filesystem type in URI: https://datahub.transportation.gov/api/views/8ect-6jqj/rows.csv?date=20231118&accessType=DOWNLOAD

Is this because of some kind of restriction by the provider or am I using the packages wrong?


Solution

  • No tool that I'm aware of is going to give you lazy/efficient access to the data when stored on a remote HTTP(s) server as CSV.

    If you want lazy access (with that file, which is quite big), then you still need to download it, and then you have several options. Two of them:

    1. arrow::read_csv_arrow("path/to/file.csv", as_data_frame=FALSE) which provides you a dplyr-friendly connection that allows "lazy" filtering and mutate/summarize and several other techniques.

      Because of how it operates, it only pulls into memory data that passes all of your filters. This means that if you follow this immediately with %>% collect() (it relies on dplyr and either base `|>` or magrittr::`%>%` for its functionality), then it will load the entire CSV into R's memory, which seems not what you want.

    2. With duckdb_read_csv(con, "some_table", files="path/to/file.csv"), which will create (or add-to) a table in a table in an existing duckdb database.

      Note: if you are using ":memory:", then this will clearly load the entire CSV file into RAM of the system, though with the small distinction that it is stored with duckdb's methodology, not R's. This might be helpful or no difference, I have not checked.

      If you go this route, I recommend you use an on-disk duckdb file so that you don't load the entire file into memory. If you're familiar with SQL, this may be preferred.

    They both have their advantages. If you are already familiar with SQL and want to capitalize on that, then perhaps the duckdb route is preferred. If you're already familiar with dplyr, then perhaps the arrow "lazy" fashion is good. If neither SQL nor dplyr are already in your personal toolkit of proficiency, then in general I think many would recommend dplyr-based tools, but I think you would do well to take inventory of your future demands, hopes, and expectations to see if SQL will be coming your way anyway. (Yes, there's also dbplyr which helps mask this SQL thing ... again, subjective. Over to you.)