Search code examples
pythonazurepython-importazure-synapse

Is there a sane way of writing an external Python library with mssparkutils calls?


I have been tasked with clarifying and simplifying a pretty knotty codebase, which exists at present as a series of "runbooks" in Azure Synapse. As part of that process, I thought it would be good to put some of the more convoluted data analysis into an external Python library. That way, we can do any software development locally, which tends to be a lot quicker; we can also carry out unit testing locally, which makes it quicker to track down bugs and fix them.

I ran into a problem early on, however, with anything related to the mssparkutils object. You can call this object freely within the runbooks themselves, but, if you attempt to call that object within a library which is imported into a runbook, your code will crash.

I discovered the dummy-notebookutils package earlier today, and I thought it might be my salvation, but now it seems that I'm out of luck. I use it like this in the library:

from notebookutils import mssparkutils
from pathlib import Path

def copy_file_to_folder(path_to_file: str, path_to_folder: str) -> str:
    mssparkutils.fs.cp(path_to_file, f"file:{path_to_folder}")
    # Return path to which file was copied.
    result = str(Path(path_to_folder)/Path(path_to_file).name)
    return result

This certainly does what I want locally, i.e. mssparkutils does nothing. But I run into seemingly-insurmountable problems when I try to import this library into a runbook. Specifically, if I try to run this line:

props = json.loads(mssparkutils.credentials.getPropertiesAll(linked_service))

It gives me this exception:

NameError: name 'mssparkutils' is not defined

If I try to fix the problem by placing an import above it:

from notebookutils import mssparkutils

Then mssparkutils.credentials.getPropertiesAll(linked_service) returns an empty string (which is was not doing before), and therefore json.loads() crashes.

It seems that just the presence of that dummy-notebookutils import is interfering with the built-in mssparkutils, and is thus causing havoc.

Is there a sane way of writing external libraries for use in Synapse like this? Or is my whole approach wrong? Is there a way of using dummy-notebookutils so that the runbook doesn't crash?

I have been thinking about just wrapping the whole library into a big class, and then, within the notebook, passing the mssparkutils object into the constructor. That would probably work, but it would disrupt the existing structure of codebase, which I am reluctant to do if there is any viable alternative.


Solution

  • I believe I've found the source of the problem. I had added dummy-notebookutils as a requirement in the setup.py of my external library. Therefore, when I imported said library into Synapse, the dummy-notebookutils was added to the sparkpool, which interfered with the built-in mssparkutils object.

    Now that I've removed dummy-notebookutils from setup.py and tried again, the issue seems to have disappeared.