Search code examples
pythonpysparksftpparquet

Reading large Parquet file from SFTP with Pyspark is slow


I'm having some issue reading data (parquet) from a SFTP server with SQLContext.

The Parquet file is quite large (6M rows).
I found some solutions to read it, but it's taking almost 1hour..

Below is the script that works but too slow.

import pyarrow as pa  
import pyarrow.parquet as pq 
from fsspec.implementations.sftp import SFTPFileSystem
  
fs = SFTPFileSystem(host = SERVER_SFTP, port = SERVER_PORT, username = USER, password = PWD)

df = pq.read_table(SERVER_LOCATION\FILE.parquet, filesystem = fs)

When the data is not in some sftp server, I use the below code, which usually works well even with large file. So How can I use SparkSQL to read a remote file in SFTP?

df = sqlContext.read.parquet('PATH/file')

Things that I tried: using SFTP library to open but seems to loose all the advantage of SparkSQL.

df = sqlContext.read.parquet(sftp.open('PATH/file'))

I also tried to use spark-sftp library, following this article without success: https://www.jitsejan.com/sftp-with-spark


Solution

  • By adding the buffer_size parameter in the pyarrow.parquet library, the computational time went from 51 to 21 minutes :)

    df = pq.read_table(SERVER_LOCATION\FILE.parquet, filesystem = fs, buffer_size = 32768)
    

    Thanks @Martin Prikryl for your help ;)