I'm working on a project to extract data from an API using PySpark and save to a file in Azure Data Lake Gen 2. Yes, I know this should be simple and I'm having no trouble extracting the data, but I'm having a heck of a time figuring out what to do with it after I get it.
My process is as follows:
workspace_query = {'format': 'json'}
response = requests.get("https://api.mavenlink.com/api/v1/workspaces", headers={'Authorization': <some things InfoSec would appreciate that I didn't put in a public forum>}, params=workspace_query)
json_response = response.json()
kantata_json_data = json.dumps(json_response, indent=4, sort_keys=True)
Okay, so from here I can, if I so choose, print the contents of kantata_json_dumps (the details of which InfoSec would also appreciate if I didn't post in a public forum) and it all looks like JSON. Yay.
What I can't figure out from here is how to actually write the contents of that variable to a file successfully. Specifically, if I try to write, I get something along these lines:
with open("https://p4syndev.dfs.core.windows.net/<Auzre subscription stuff>/<my foldername>/", "w") as file:
outfile.write(kantata_json_data)
And then the error:
FileNotFoundError Traceback (most recent call last)
Cell In [23], line 1
----> 1 with open("https://p4syndev.dfs.core.windows.net/<Auzre subscription stuff>/<my foldername>/", "w") as file:
2 outfile.write(kantata_json_data)
File ~/cluster-env/env/lib/python3.10/site-packages/IPython/core/interactiveshell.py:282, in _modified_open(file, *args, **kwargs)
275 if file in {0, 1, 2}:
276 raise ValueError(
277 f"IPython won't let you open fd={file} by default "
278 "as it is likely to crash IPython. If you know what you are doing, "
279 "you can use builtins' open."
280 )
--> 282 return io_open(file, *args, **kwargs)
This is confusing because I had assumed that I'd be able to write to a folder with either the URL or ABFSS path but I get the same error for both. I'm new to Azure and there absolutely could be some secret sauce I'm missing here.
So I tried to write to a dataframe and then convert that to JSON (which seems frankly unpythonic but whatever) and recieved an IllegalArgumentException.
Any thoughts about how I should handle this are appreciated.
This is my json data.
You can use below code to write.
import pandas as pd
df = pd.DataFrame(json.loads(json_data))
df.to_json('abfs://dats/output/json_files/user.json', storage_options = {'linked_service' : 'AzureDataLakeStorage1'})
Output:
Here I am creating pandas dataframe writing it to storage account using abfs protocal.
You can use path like any one of the following.
abfs://<container_name>/<path_to_file>
OR
abfs://container_name@Storage_account.dfs.core.windows.net/path_to_file
And there will be linked service default in your synapse workspace. Give that name in storage options or if you created the linked service use that.
You cannot use this kind of protocols to access files using open function, because it searches files in local filesystem, that is pool you attached.