Search code examples
pythonjsonazurepysparkazure-synapse

Writing Json Data to file in Azure Synapse PySpark Notebook


I'm working on a project to extract data from an API using PySpark and save to a file in Azure Data Lake Gen 2. Yes, I know this should be simple and I'm having no trouble extracting the data, but I'm having a heck of a time figuring out what to do with it after I get it.

My process is as follows:

workspace_query = {'format': 'json'}
response = requests.get("https://api.mavenlink.com/api/v1/workspaces", headers={'Authorization': <some things InfoSec would appreciate that I didn't put in a public forum>}, params=workspace_query)

json_response = response.json()
kantata_json_data = json.dumps(json_response, indent=4, sort_keys=True)

Okay, so from here I can, if I so choose, print the contents of kantata_json_dumps (the details of which InfoSec would also appreciate if I didn't post in a public forum) and it all looks like JSON. Yay.

What I can't figure out from here is how to actually write the contents of that variable to a file successfully. Specifically, if I try to write, I get something along these lines:

with open("https://p4syndev.dfs.core.windows.net/<Auzre subscription stuff>/<my foldername>/", "w") as file:
outfile.write(kantata_json_data)

And then the error:

FileNotFoundError                         Traceback (most recent call last)
Cell In [23], line 1
----> 1 with open("https://p4syndev.dfs.core.windows.net/<Auzre subscription stuff>/<my foldername>/", "w") as file:
      2     outfile.write(kantata_json_data)

File ~/cluster-env/env/lib/python3.10/site-packages/IPython/core/interactiveshell.py:282, in _modified_open(file, *args, **kwargs)
    275 if file in {0, 1, 2}:
    276     raise ValueError(
    277         f"IPython won't let you open fd={file} by default "
    278         "as it is likely to crash IPython. If you know what you are doing, "
    279         "you can use builtins' open."
    280     )
--> 282 return io_open(file, *args, **kwargs)

This is confusing because I had assumed that I'd be able to write to a folder with either the URL or ABFSS path but I get the same error for both. I'm new to Azure and there absolutely could be some secret sauce I'm missing here.

So I tried to write to a dataframe and then convert that to JSON (which seems frankly unpythonic but whatever) and recieved an IllegalArgumentException.

Any thoughts about how I should handle this are appreciated.


Solution

  • This is my json data.

    enter image description here

    You can use below code to write.

    import pandas as pd
    df = pd.DataFrame(json.loads(json_data))
    df.to_json('abfs://dats/output/json_files/user.json', storage_options = {'linked_service' : 'AzureDataLakeStorage1'})
    

    enter image description here

    Output:

    enter image description here

    Here I am creating pandas dataframe writing it to storage account using abfs protocal.

    You can use path like any one of the following.

    abfs://<container_name>/<path_to_file>

    OR

    abfs://container_name@Storage_account.dfs.core.windows.net/path_to_file

    And there will be linked service default in your synapse workspace. Give that name in storage options or if you created the linked service use that.

    enter image description here

    You cannot use this kind of protocols to access files using open function, because it searches files in local filesystem, that is pool you attached.