Search code examples
pythonexcelpandasbytesio

How can I read a .xlsx file as a pandas DataFrame passing the bytes of the file stored in a local variable?


I need to read an excel file in ".xlsx" format that is stored in a ftp server as a pandas DataFrame. I need to do this inside a docker container running in heroku and I don't think is a good practice downloading the file into the container and then read it from the local path. Instead I would like to save the bytes in a local variable and pass it to the pandas.read_excel() function to get my DataFrame. According to the documentation of pandas it is possible but I have tried all the forms of calling the function that I could understand and still error messages are raised. I have found answers to the .csv case containing text passed as bytes, but not to the excel file passed as bytes.

Here are some code corresponding to what I think is my best güess of the solution, but still doesn´t work. Maybe you can help me understand if there is a step I'm missing:

from ftplib import FTP
import pandas as pd
from io import BytesIO

# stablish ftp connection
ftp = FTP()
ftp.connect(host='myhost',port=myport)
ftp.login(user='myuser',passwd='mypassword')

# create a class with the __call__ method that can be used to store the bytes
class BinaryReader:
  def __init__(self):
    self.data = BytesIO()
  def __call__(self,s):
     self.data = BytesIO(s)

# Create an instance of that class and read the bytes into it
myBinary = BinaryReader()
folder = 'myFolder/'
file = 'myFile.xlsx'
ftp.retrbinary('RETR '+ folder + file, myBinary)

# Read my file usind pandas
pd.read_excel(myBinary.data, engine='openpyxl')

I have to use an engine because pandas cannot asume any extension of the file that I am passing as io. The code from above raises the following exception exactly in the last line:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_1022/801386693.py in <module>
----> 1 pd.ExcelFile(myBinary.data,engine='openpyxl')

~/anaconda3/envs/py310/lib/python3.10/site-packages/pandas/io/excel/_base.py in __init__(self, path_or_buffer, engine, storage_options)
   1417         self.storage_options = storage_options
   1418 
-> 1419         self._reader = self._engines[engine](self._io, storage_options=storage_options)
   1420 
   1421     def __fspath__(self):

~/anaconda3/envs/py310/lib/python3.10/site-packages/pandas/io/excel/_openpyxl.py in __init__(self, filepath_or_buffer, storage_options)
    523         """
    524         import_optional_dependency("openpyxl")
--> 525         super().__init__(filepath_or_buffer, storage_options=storage_options)
    526 
    527     @property

~/anaconda3/envs/py310/lib/python3.10/site-packages/pandas/io/excel/_base.py in __init__(self, filepath_or_buffer, storage_options)
    516             self.handles.handle.seek(0)
    517             try:
--> 518                 self.book = self.load_workbook(self.handles.handle)
    519             except Exception:
    520                 self.close()
...
--> 744             self._file.seek(self._pos)
    745             data = self._file.read(n)
    746             self._pos = self._file.tell()

ValueError: negative seek value -9342652

I have tried passing the bytes and specifying the 'openpyxl' engine that is the one that can read '.xlsx' format, but the "negative seek value" error has been shown. I would like to pass pandas.read_excel function the bytes and that the functions to know the format of the file and return a DataFrame, maybe via some other argument that specifyes the file extension, or the engine. Another excel reader that works for reading the bytes can work for me as well.


Solution

  • Test server setup with pyftpdlib to serve files in current working directory

    from pyftpdlib.authorizers import DummyAuthorizer
    from pyftpdlib.handlers import FTPHandler
    from pyftpdlib.servers import FTPServer
    import os
    
    HOST = "localhost"
    PORT = 21
    
    authorizer = DummyAuthorizer()
    authorizer.add_anonymous(os.getcwd())
    
    handler = FTPHandler
    handler.authorizer = authorizer
    
    server = FTPServer((HOST, PORT), handler)
    server.serve_forever()
    

    Example code to read excel file called test.xlsx

    from ftplib import FTP
    from io import BytesIO
    from pandas import read_excel
    
    HOST = "localhost"
    PORT = 21
    
    ftp = FTP()
    ftp.connect(host=HOST, port=PORT)
    ftp.login()
    
    f = BytesIO()
    ftp.retrbinary("RETR " + "test.xlsx", f.write)
    df = read_excel(f.getvalue())
    print(df.info)
    ftp.quit()