Search code examples
python-3.xaws-lambdaamazon-emraws-glueaws-glue-data-catalog

How to configure "Use AWS Glue Data Catalog for table metadata" for EMR cluster option through boto library?


I am trying to create an EMR cluster by writing a AWS lambda function using python boto library.However I am able to create the cluster but I want to use "AWS Glue Data Catalog for table metadata" so that I can use spark to directly read from the glue data catalog.While creating the EMR cluster through AWS user interface I usually check in a checkbox ("Use AWS Glue Data Catalog for table metadata") which solves my purpose.But I am not getting any clue how can I achieve the same through boto library.

Below is the python code which I am using to create the EMR cluster.

    try:
        connection = boto3.client(
            'emr',
            region_name='xxx'
        )
        cluster_id = connection.run_job_flow(
            Name='EMR-LogProcessing',
            LogUri='s3://somepath/',
            ReleaseLabel='emr-5.21.0',
            Applications=[
                {
                    'Name': 'Spark'
                },
            ],
            Instances={
                'InstanceGroups': [
                    {
                        'Name': "MasterNode",
                        'Market': 'SPOT',
                        'InstanceRole': 'MASTER',
                        'BidPrice': 'xxx',
                        'InstanceType': 'm3.xlarge',
                        'InstanceCount': 1,
                    },
                    {
                        'Name': "SlaveNode",
                        'Market': 'SPOT',
                        'InstanceRole': 'CORE',
                        'BidPrice': 'xxx',
                        'InstanceType': 'm3.xlarge',
                        'InstanceCount': 2,
                    }
                ],
                'Ec2KeyName': 'xxx',
                'KeepJobFlowAliveWhenNoSteps': True,
                'TerminationProtected': False
            },
            VisibleToAllUsers=True,
            JobFlowRole='EMR_EC2_DefaultRole',
            ServiceRole='EMR_DefaultRole',
            Tags=[
                {
                    'Key': 'Name',
                    'Value': 'EMR-LogProcessing',
                },
                {
                    'Key': 'env',
                    'Value': 'dev',
                },
            ],
        )

        print('cluster created with the step...', cluster_id['JobFlowId'])
    except Exception as exp:
        logger.info("Exception Occured in createEMRcluster!!! %s", str(exp))

I am not finding any clue how can I achieve it.Please help.


Solution

  • Specify the value for hive.metastore.client.factory.class using the hive-site configuration classification

    [
      {
        "Classification": "hive-site",
        "Properties": {
          "hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
        }
      }
    ]
    

    The above code snippet can be passed to boto's run_job_flow fucntion using configuration property.

    Reference: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-metastore-glue.html