Search code examples
pythonscikit-learnobject-storageibm-cloud-storage

ibm_boto3 compatibility issue with scikit-learn on Mac OS


I have a Python 3.6 application that uses scikit-learn, deployed to IBM Cloud (Cloud Foundry). It works fine. My local development environment is Mac OS High Sierra.

Recently, I added IBM Cloud Object Storage functionality (ibm_boto3) to the app. The COS functionality itself works fine. I'm able to upload, download, list, and delete objects just fine using the ibm_boto3 library.

Strangely, the part of the app that uses scikit-learn now freezes up.

If I comment out the ibm_boto3 import statements (and corresponding code), the scikit-learn code works fine.

More perplexingly, the issue only happens on the local development machine running OS X. When the app is deployed to IBM Cloud, it works fine -- both scikit-learn and ibm_boto3 work well side-by-side.

Our only hypothesis at this point is that somehow the ibm_boto3 library surfaces a known issue in scikit-learn (see this -- parallel version of the K-means algorithm is broken when numpy uses Accelerator on OS X). Note that we only face this issue once we add ibm_boto3 to the project.

However, we need to be able to test on localhost before deploying to IBM Cloud. Are there any known compatibility issues between ibm_boto3 and scikit-learn on Mac OS?

Any suggestions on how we can avoid this on the dev machine?

Cheers.


Solution

  • Up until now, there weren't any known compatibility issues. :)

    At some point there were some issues with the vanilla SSL libraries that come with OSX, but if you're able to read and write data that isn't the problem.

    Are you using HMAC credentials? If so, I'm curious if the behavior continues if you use the original boto3 library instead of the IBM fork.

    Here's a simple examples that shows how you might use pandas with the original boto3:

    import boto3  # package used to connect to IBM COS using the S3 API
    import io  # python package used to stream data
    import pandas as pd  # lightweight data analysis package
    
    access_key = '<access key>'
    secret_key = '<secret key>'
    pub_endpoint = 'https://s3-api.us-geo.objectstorage.softlayer.net'
    pvt_endpoint = 'https://s3-api.us-geo.objectstorage.service.networklayer.com'
    bucket = 'demo'  # the bucket holding the objects being worked on.
    object_key = 'demo-data'  # the name of the data object being analyzed.
    result_key = 'demo-data-results'  # the name of the output data object.
    
    
    # First, we need to open a session and create a client that can connect to IBM COS.
    # This client needs to know where to connect, the credentials to use,
    # and what signature protocol to use for authentication. The endpoint
    # can be specified to be public or private.
    cos = boto3.client('s3', endpoint_url=pub_endpoint,
                       aws_access_key_id=access_key,
                       aws_secret_access_key=secret_key,
                       region_name='us',
                       config=boto3.session.Config(signature_version='s3v4'))
    
    # Since we've already uploaded the dataset to be worked on into cloud storage,
    # now we just need to identify which object we want to use. This creates a JSON
    # representation of request's response headers.
    obj = cos.get_object(Bucket=bucket, Key=object_key)
    
    # Now, because this is all REST API based, the actual contents of the file are
    # transported in the request body, so we need to identify where to find the
    # data stream containing the actual CSV file we want to analyze.
    data = obj['Body'].read()
    
    # Now we can read that data stream into a pandas dataframe.
    df = pd.read_csv(io.BytesIO(data))
    
    # This is just a trivial example, but we'll take that dataframe and just
    # create a JSON document that contains the mean values for each column.
    output = df.mean(axis=0, numeric_only=True).to_json()
    
    # Now we can write that JSON file to COS as a new object in the same bucket.
    cos.put_object(Bucket=bucket, Key=result_key, Body=output)