Search code examples
pythonpandasluigi

Atomically read from excel (for luigi workflow)


I'm trying to open an excel file in my Luigi workflow using pandas.read_excel() using the built in (atomic) luigi methods.

if self.input() is my luigi target of my excel document, I want to do something like:

with self.input().open('r') as f: pandas.read_excel(f)

or more generally:

with open(filename) as f: pandas.read_excel(f)

However, this gives me an error: *** UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 10: invalid continuation byte

Disclaimer:

The excel file is from an external task, so I do not have control over what type of computer it is made on or whether or not it contains NAs or blank cells.


Solution

  • The issue was that my self.input() (that points to the place where my excel file is saved) should have used format = Nop. My luigi target should return something like:

    luigi.LocalTarget('excelfile.xlsx', format=luigi.format.Nop)
    

    With this target definition, I can atomically read using:

    with self.input().open() as f:
        df = pd.read_excel(f)