Search code examples
gpudaskparquetcudf

What is the relationship between BlazingSQL and dask?


I'm trying to understand if BlazingSQL is a competitor or complementary to dask.

I have some medium-sized data (10-50GB) saved as parquet files on Azure blob storage.

IIUC I can query, join, aggregate, groupby with BlazingSQL using SQL syntax, but I can also read the data into CuDF using dask_cudf and do all same operations using python/dataframe syntax.

So, it seems to me that they're direct competitors?

Is it correct that (one of) the benefits of using dask is that it can operate on partitions so can operate on datasets larger than GPU memory whereas BlazingSQL is limited to what can fit on the GPU?

Why would one choose to use BlazingSQL rather than dask?

Edit:
The docs talk about dask_cudf but the actual repo is archived saying that dask support is now in cudf itself. It would be good to know how to leverage dask to operate on larger-than-gpu-memory datasets with cudf


Solution

  • Full disclosure I'm a co-founder of BlazingSQL.

    BlazingSQL and Dask are not competitive, in fact you need Dask to use BlazingSQL in a distributed context. All distibured BlazingSQL results return dask_cudf result sets, so you can then continuer operations on said results in python/dataframe syntax. To your point, you are correct on two counts:

    1. BlazingSQL is currently limited to GPU memory, and actually some system memory by leveraging CUDA's Unified Virtual Memory. That will change soon, we are estimating around v0.13 which is scheduled for an early March release. Upon that release, memory will spill off and cache to system memory, local drives, or even our supported storage plugins such as AWS S3, Google Cloud Storage, and HDFS.
    2. You can totally write SQL operations as dask_cudf functions, but it is incumbent on the user to know all of those functions, and optimize their usage of them. SQL has a variety of benefits in that it is more accessible (more people know it, and it's very easy to learn), and there is a great deal of research around optimizing SQL (cost-based optimizers for example) for running queries at scale.

    If you wish to make RAPIDS accessible to more users SQL is a pretty easy onboarding process, and it's very easy to optimize for because of the reduced scope necessary to optimize SQL operations over Dask which has many other considerations.