Search code examples
python-3.xgoogle-cloud-platformgoogle-cloud-vertex-ai

Vertex AI scheduled notebooks doesn't recognize existence of folders


I have a managed jupyter notebook in Vertex AI that I want to schedule. The notebook works just fine as long as I start it manually, but as soon as it is scheduled, it fails. There are in fact many things that go wrong when scheduled, some of them are fixable. Before explaining what my trouble is, let me first give some details of the context.

The notebook gathers information from an API for several stores and saves the data in different folders before processing it, saving csv-files to store-specific folders and to bigquery. So, in the location of the notebook, I have:

  • The notebook
  • Functions needed for the handling of data (as *.py files)
  • A series of folders, some of which have subfolders which also have subfolders

enter image description here

When I execute this manually, no problem. Everything works well and all files end up exactly where they should, as well as in different bigQuery tables.

However, when scheduling the execution of the notebook, everything goes wrong. First, the files *.py cannot be read (as import). No problem, I added the functions in the notebook.

Now, the following error is where I am at a loss, because I have no idea why it does work or how to fix it. The code that leads to the error is the following:

internal = "https://api.************************"

df_descriptions = [] 

storess = internal
response_stores = requests.get(storess,auth = HTTPBasicAuth(userInternal, keyInternal))
pathlib.Path("stores/request_1.json").write_bytes(response_stores.content)

filepath = "stores"

files = os.listdir(filepath)

for file in files:
    with open(filepath + "/"+file) as json_string:
        jsonstr = json.load(json_string)
        information = pd.json_normalize(jsonstr)
    df_descriptions.append(information)

StoreINFO = pd.concat(df_descriptions)
StoreINFO = StoreINFO.dropna()
StoreINFO = StoreINFO[StoreINFO['storeIdMappings'].map(lambda d: len(d)) > 0]

cloud_store_ids = list(set(StoreINFO.cloudStoreId))

LastWeek = datetime.date.today()- timedelta(days=2)
LastWeek =np.datetime64(LastWeek)

and the error reported is:

FileNotFoundError                         Traceback (most recent call last)
/tmp/ipykernel_165/2970402631.py in <module>
      5 storess = internal
      6 response_stores = requests.get(storess,auth = HTTPBasicAuth(userInternal, keyInternal))
----> 7 pathlib.Path("stores/request_1.json").write_bytes(response_stores.content)
      8 
      9 filepath = "stores"

/opt/conda/lib/python3.7/pathlib.py in write_bytes(self, data)
   1228         # type-check for the buffer interface before truncating the file
   1229         view = memoryview(data)
-> 1230         with self.open(mode='wb') as f:
   1231             return f.write(view)
   1232 

/opt/conda/lib/python3.7/pathlib.py in open(self, mode, buffering, encoding, errors, newline)
   1206             self._raise_closed()
   1207         return io.open(self, mode, buffering, encoding, errors, newline,
-> 1208                        opener=self._opener)
   1209 
   1210     def read_bytes(self):

/opt/conda/lib/python3.7/pathlib.py in _opener(self, name, flags, mode)
   1061     def _opener(self, name, flags, mode=0o666):
   1062         # A stub for the opener argument to built-in open()
-> 1063         return self._accessor.open(self, flags, mode)
   1064 
   1065     def _raw_open(self, flags, mode=0o777):

FileNotFoundError: [Errno 2] No such file or directory: 'stores/request_1.json'

I would gladly find another way to do this, for instance by using GCS buckets, but my issue is the existence of sub-folders. There are many stores and I do not wish to do this operation manually because some retailers for which I am doing this have over 1000 stores. My python code generates all these folders and as I understand it, this is not feasible in GCS.

How can I solve this issue?


Solution

  • GCS uses a flat namespace, so folders don't actually exist, but can be simulated as given in this documentation.For your requirement, you can either use absolute path (starting with "/" -- not relative) or create the "stores" directory (with "mkdir"). For more information you can check this blog.