I have a several gigabyte CSV file residing in Azure Data Lake. Using Dask, I can read this file in under a minute as follows:
>>> import dask.dataframe as dd
>>> adl_path = 'adl://...'
>>> df = dd.read_csv(adl_path, storage_options={...})
>>> len(df.compute())
However, I don't want to read this into a Dask or Pandas DataFrame -- I want direct access to the underlying file. (Currently it's CSV, but I'd also like to be able to handle Parquet files.) So I am also trying to use adlfs 0.2.0:
>>> import fsspec
>>> adl = fsspec.filesystem('adl', store_name='...', tenant_id=...)
>>> lines = 0
>>> with adl.open(adl_path) as fh:
>>> for line in fh:
>>> lines += 1
In the same amount of time as the Dask process, this method has only read in 0.1% of the input.
I've tried using fsspec
's caching, thinking that this would speed up access after the initial caching is done:
>>> fs = fsspec.filesystem("filecache", target_protocol='adl', target_options={...}, cache_storage='/tmp/files/')
>>> fs.exists(adl_path) # False
>>> fs.size(adl_path) # FileNotFoundError
>>> # Using a relative path instead of fully-qualified (FQ) path:
>>> abs_adl_path = 'absolute/path/to/my/file.csv'
>>> fs.exists(abs_adl_path) # True
>>> fs.size(abs_adl_path) # 1234567890 -- correct size in bytes
>>> fs.get(abs_adl_path, local_path) # FileNotFoundError
>>> handle = fs.open(abs_adl_path) # FileNotFoundError
Is there a performant way to read CSVs (and also Parquet) remotely as a normal Python filehandle without loading as a Dask DataFrame first?
I do not know why fs.get
doesn't work, but please try this for the final line:
handle = fs.open(adl_path)
i.e., you open the original path, but you get a file handle to a local file (once the copy is done) somewhere in '/tmp/files/'.