Search code examples
sql-serverperformancepandasstreamingparquet

What's is the faster way to extract 1 terabyte of data from tables in SQL Server to Parquet files without hadoop


I need to extract 2 tables from a SQL Server Database to files in Apache Parquet (I don't use Hadoop, only parquet files). The options I know of are:

  1. Load data to a dataframe in Pandas and save to parquet file. But, this method don't stream the data from SQL Server to Parquet, and i have 6 GB of RAM memory only.

  2. Use TurboODBC to query SQL Server, convert the data to Apache Arrow on the fly and then convert to Parquet. Same problem that above, TurboODBC doesn't stream currently.

Does a tool or library exist that can easily and "quickly" extract the 1 TB of data from tables in SQL Server to parquet files?


Solution

  • The missing functionality you are looking for is the retrieval of the result in batches with Apache Arrow in Turbodbc instead of the whole Table at once: https://github.com/blue-yonder/turbodbc/issues/133 You can either help with the implementation of this feature or use fetchnumpybatches to retrieve the result in a chunked fashion meanwhile.

    In general, I would recommend you to not export the data as one big Parquet file but as many smaller ones, this will make working with them much easier. Mostly all engines/artifacts that can consume Parquet will be able to handle multiple files as one big dataset. You can then also split your query into multiple ones that write out the Parquet files in parallel. If you limit the export to chunks that are smaller than your total main memory, you should also be able to use fetchallarrow to write to Parquet at once.