Search code examples
ftppycurl

pycurl and MLST


For several reasons, we would like to use pycurl to get information on a file stored on a FTP server with the MLST command.

We get almost what we need with the following code:

# More or less equivalent to: curl --list -X MLST -D /tmp/headers ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/archaea/assembly_summary.txt 
import pycurl
try:
    from io import BytesIO
except ImportError:
    from StringIO import StringIO as BytesIO
c = pycurl.Curl()
c.setopt(c.URL, r'ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/archaea/assembly_summary.txt')
c.setopt(pycurl.DIRLISTONLY, True)
# Use MLST
c.setopt(c.CUSTOMREQUEST, "MLST")
# Write header to buffer
output = BytesIO()
c.setopt(pycurl.HEADERFUNCTION, output.write)
# Perform request
c.perform()
# Print header
result = output.getvalue()
result = result.decode('ISO-8859-1')

perform() fails with CURLE_FTP_COULDNT_RETR_FILE but result (the headers) contains what we need. If you try the CLI version, the return code is also CURLE_FTP_COULDNT_RETR_FILE but the file /tmp/headers contains the data.

We think that is is related to the fact that MLST use the control connection and not the data connection.

Any idea ?

EDIT 1

We haven't found a way to get the result without DIRLISTONLY (which is weird). Also if we use NOBODY we don't get the answer.

EDIT 2

It turns out that result contains the information about the directory (ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/archaea/) not the file so the code here is incorrect.


Solution

  • It turns out that this is hard (if not impossible) to do (see EDIT 2). However a simple code allows to get the most important information (file size and last modification time).

    The code is based on the getinfo method (and the OPT_FILETIME option):

    import pycurl
    try:
        from io import BytesIO
    except ImportError:
        from StringIO import StringIO as BytesIO
    import datetime
    c = pycurl.Curl()
    c.setopt(c.URL, r'ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/archaea/assembly_summary.txt')
    c.setopt(pycurl.NOBODY, True)
    c.setopt(pycurl.OPT_FILETIME, True)
    # Perform request
    c.perform()
    # Print info
    timestamp = c.getinfo(pycurl.INFO_FILETIME)
    print(datetime.datetime.fromtimestamp(timestamp))
    print(c.getinfo(pycurl.CONTENT_LENGTH_DOWNLOAD))
    

    Of course, we use NOBODY to avoid downloading the file.

    This is more or less equivalent to the command:

    $ curl --head ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/archaea/assembly_summary.txt
    Last-Modified: Thu, 07 Nov 2019 11:58:21 GMT
    Content-Length: 1207490
    Accept-ranges: bytes