Search code examples
pythonpandascsvftpftplib

Read CSV files to Pandas dataframe from FTP with Python ftplib


I am getting a "[Errno 2] No such file or directory" error when trying to download files from an FTP server to a Pandas dataframe. The files are in the root directory of the FTP.

I am guessing that the pd.read_csv() function is looking at my local file system... i.e. at the local path were the script resides.... But I do not understand how to change this.

def fetch_files(site, username, password, directory: str = '/', filematch: str = '*.csv'):
    with ftplib.FTP(site) as ftp:
        # pass the url without protocol
        ftp = ftplib.FTP(site)
        # pass credentials if anonymous access is not allowed
        ftp.login(username, password)
        ftp.cwd(directory)
        list_ = []
        for file_ in ftp.nlst(filematch):
            print(file_) # This works
            df = pd.read_csv(file_, index_col=None, header=0) # This fails
            list_.append(df)

Or would I have to use the ftp.retrlines() method? If so what is the difference between the LIST and MLSD parameter?

On a side note: The files in the CSVs have HTML code in them like & which screws out the SQL bulk insert. That's the reason I am reading them to a dataframe is to change the encoding and merge the individual files. Is there a faster way to do this directly via the Python csv module? I guess this would be faster?

Thank you in advance


Solution

  • Use FTP.retrbinary and BytesIO to download the file to memory and then pass the in-memory file-like object to read_csv:

    flo = BytesIO()
    ftp.retrbinary('RETR ' + file_, flo.write)
    flo.seek(0)
    pd.read_csv(flo, ...)
    

    Similar question: Reading files from FTP server to DataFrame in Python


    The above loads whole CSV file to a memory and only then it will parse it. If you want to parse the file as it downloads, it that would probably require implementing a smart custom file-like object. What is not easy.

    For a question that does something similar, see my answer to:
    Get files names inside a zip file on FTP server without downloading whole archive.


    Though, pandas.read_csv documentation claims that it supports FTP directly.
    So this should do too:

    pd.read_csv("ftp://username:[email protected]/remote/path/" + file_)