Search code examples
pythonazurecortana-intelligenceazure-machine-learning-service

How could I save dataset from ipython notebook in Azure ML Studio?


I use next command to save output results:

ws.datasets.add_from_dataframe(data, 'GenericCSV', 'output.csv', 'Uotput results')

where ws is azureml.Workspace object and data is pandas.DataFrame.

It works fine if my dataset size less than 4 mb. Otherwise I got a error:

AzureMLHttpError: Maximum request length exceeded.

As I understood this is the error raised by Azure environment limits and the maximum size of the dataset could not be changed.

I could split my dataset to 4 mb parts and download them from Azure ML studio, but it is very inconvinient if size of my output dataset is more than 400 mb.


Solution

  • I have read the source code in the python package azureml, and found out that they are using a simple request post when uploading a dataset, which has a limited content length 4194304 bytes.

    I tried to modify the code inside "http.py" within the python package azureml. I posted the request with a chunked data, and I got the following error:

    Traceback (most recent call last):
      File ".\azuremltest.py", line 10, in <module>
        ws.datasets.add_from_dataframe(frame, 'GenericCSV', 'output2.csv', 'Uotput results')
      File "C:\Python34\lib\site-packages\azureml\__init__.py", line 507, in add_from_dataframe
        return self._upload(raw_data, data_type_id, name, description)
      File "C:\Python34\lib\site-packages\azureml\__init__.py", line 550, in _upload
    raw_data, None)
      File "C:\Python34\lib\site-packages\azureml\http.py", line 135, in upload_dataset
        upload_result = self._send_post_req(api_path, raw_data)
      File "C:\Python34\lib\site-packages\azureml\http.py", line 197, in _send_post_req
        raise AzureMLHttpError(response.text, response.status_code)
    azureml.errors.AzureMLHttpError: Chunked transfer encoding is not permitted. Upload size must be indicated in the Content-Length header.
    Request ID: 7b692d82-845c-4106-b8ec-896a91ecdf2d 2016-03-14 04:32:55Z
    

    The REST API in azureml package does not support chunked transfer encoding. Hence, I took a look at how the Azure ML studio implements this, and I found out this:

    1. It post a request with content-length=0 to https://studioapi.azureml.net/api/resourceuploads/workspaces/<workspace_id>/?userStorage=true&dataTypeId=GenericCSV, which will return an id in the response body.

    2. Break the .csv file into chunks less than 4194304 bytes, and post them to https://studioapi.azureml.net/api/blobuploads/workspaces/<workspace_id>/?numberOfBlocks=<the number of chunks>&blockId=<index of chunk>&uploadId=<the id you get from previous request>&dataTypeId=GenericCSV

    If you really want this functionality, you can implement it with python and the above REST API.

    If you think it's too complicated, report the issue to this. The azureml python package is still under development, so your suggestion would be very helpful for them.