I'm having some issue reading data (parquet) from a SFTP server with SQLContext.
The Parquet file is quite large (6M rows).
I found some solutions to read it, but it's taking almost 1hour..
Below is the script that works but too slow.
import pyarrow as pa
import pyarrow.parquet as pq
from fsspec.implementations.sftp import SFTPFileSystem
fs = SFTPFileSystem(host = SERVER_SFTP, port = SERVER_PORT, username = USER, password = PWD)
df = pq.read_table(SERVER_LOCATION\FILE.parquet, filesystem = fs)
When the data is not in some sftp server, I use the below code, which usually works well even with large file. So How can I use SparkSQL to read a remote file in SFTP?
df = sqlContext.read.parquet('PATH/file')
Things that I tried: using SFTP library to open but seems to loose all the advantage of SparkSQL.
df = sqlContext.read.parquet(sftp.open('PATH/file'))
I also tried to use spark-sftp library, following this article without success: https://www.jitsejan.com/sftp-with-spark
By adding the buffer_size
parameter in the pyarrow.parquet library, the computational time went from 51 to 21 minutes :)
df = pq.read_table(SERVER_LOCATION\FILE.parquet, filesystem = fs, buffer_size = 32768)
Thanks @Martin Prikryl for your help ;)