Search code examples
pythonamazon-web-servicesaws-lambdaamazon-rds

Convert JSON with dictionaries into pandas Dataframe (AWS)


I have an Amazon serverless Aurora SQL database instance with some debt installments data. I was trying to connect on the DB with AWS Lambda (python 3.7) and found this method:

import boto3
rds_client = boto3.client('rds-data')

database_name = 'dbname'
db_cluster_arn = 'arn:aws:rds:us-east-1:xxxx:cluster:xxxx'
db_credentials_secrets_store_arn = 'arn:aws:secretsmanager:us-east-1:xxxx:secret:rds-db-credentials/cluster-xxxx/'

def lambda_handler(event, context):
    response = execute_statement('SELECT * FROM focafidc.estoque');
    json_string = str(response)
    return response

def execute_statement(sql):
    response = rds_client.execute_statement(
        secretArn=db_credentials_secrets_store_arn,
        database=database_name,
        resourceArn=db_cluster_arn,
        sql=sql
    )
    return response;

The response returns something like a JSON with data nested in dictionaries:

{
  "ResponseMetadata": {
    "RequestId": "f7df6de2-8144-4b7b-9cf0-c828454b4a0d",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "x-amzn-requestid": "f7df6de2-8144-4b7b-9cf0-c828454b4a0d",
  "content-type": "application/json",
  "content-length": "324685",
  "date": "Tue, 10 May 2022 13:51:57 GMT"
},
"RetryAttempts": 0
  },
  "numberOfRecordsUpdated": 0,
  "records": [
[
  {
    "stringValue": "2022-05-02"
  },
  {
    "longValue": 1
  },
  {
    "longValue": 1
  },
  {
    "stringValue": "a3789"
  },
  {
    "stringValue": "519.60"
  },
  {
    "stringValue": "2023-05-02"
  },
  {
    "stringValue": "2598.00"
  },
  {
    "longValue": 666000002
  },
  {
    "stringValue": "1.88"
  },
  {
    "stringValue": "b190"
  },
  {
    "stringValue": "1996-03-25"
  },
  {
    "stringValue": "Brasileiro"
  },
  {
    "stringValue": "masculino"
  },
  {
    "stringValue": "false"
  },
  {
    "stringValue": "SP"
  },
  {
    "stringValue": "São Paulo"
  },
  {
    "longValue": 111111111
  },
  {
    "longValue": 1111111111
  },
  {
    "booleanValue": true
  },
  {
    "stringValue": "LOJAS S.A."
  },
  {
    "stringValue": "99999999999999"
  }
],
[
  {
    "stringValue": "2022-05-02"
  },
  {
    "longValue": 1
  },
  {
    "longValue": 2
  },
  {
    "stringValue": "a3789"
  },
  {
    "stringValue": "519.60"
  },
  {
    "stringValue": "2024-05-01"
  },
  {
    "stringValue": "2598.00"
  },
  {
    "longValue": 666000002
  },
  {
    "stringValue": "1.88"
  },
  {
    "stringValue": "b190"
  },
  {
    "stringValue": "1996-03-25"
  },
  {
    "stringValue": "Brasileiro"
  },
  {
    "stringValue": "masculino"
  },
  {
    "stringValue": "false"
  },
  {
    "stringValue": "SP"
  },
  {
    "stringValue": "São Paulo"
  },
  {
    "longValue": 111111111
  },
  {
    "longValue": 1111111111
  },
  {
    "booleanValue": true
  },
  {
    "stringValue": "LOJAS S.A."
  },
  {
    "stringValue": "99999999999999"
  }
],
[
  {
    "stringValue": "2022-05-02"
  },
  {
    "longValue": 1
  },
  {
    "longValue": 3
  },
  {
    "stringValue": "a3789"
  },
  {
    "stringValue": "519.60"
  },
  {
    "stringValue": "2025-05-01"
  },
  {
    "stringValue": "2598.00"
  },
  {
    "longValue": 666000002
  },
  {
    "stringValue": "1.88"
  },
  {
    "stringValue": "b190"
  },
  {
    "stringValue": "1996-03-25"
  },
  {
    "stringValue": "Brasileiro"
  },
  {
    "stringValue": "masculino"
  },
  {
    "stringValue": "false"
  },
  {
    "stringValue": "SP"
  },
  {
    "stringValue": "São Paulo"
  },
  {
    "longValue": 111111111
  },
  {
    "longValue": 1111111111
  },
  {
    "booleanValue": true
  },
  {
    "stringValue": "LOJAS S.A."
  },
  {
    "stringValue": "99999999999999"
  }
],
[
  {
    "stringValue": "2022-05-02"
  },
  {
    "longValue": 1
  },
  {
    "longValue": 4
  },
  {
    "stringValue": "a3789"
  },
  {
    "stringValue": "519.60"
  },
  {
    "stringValue": "2026-05-01"
  },
  {
    "stringValue": "2598.00"
  },
  {
    "longValue": 666000002
  },
  {
    "stringValue": "1.88"
  },
  {
    "stringValue": "b190"
  },
  {
    "stringValue": "1996-03-25"
  },
  {
    "stringValue": "Brasileiro"
  },
  {
    "stringValue": "masculino"
  },
  {
    "stringValue": "false"
  },
  {
    "stringValue": "SP"
  },
  {
    "stringValue": "São Paulo"
  },
  {
    "longValue": 111111111
  },
  {
    "longValue": 1111111111
  },
  {
    "booleanValue": true
  },
  {
    "stringValue": "LOJAS S.A."
  },
  {
    "stringValue": "99999999999999"
  }
],
[
  {
    "stringValue": "2022-05-02"
  },
  {
    "longValue": 1
  },
  {
    "longValue": 5
  },
  {
    "stringValue": "a3789"
  },
  {
    "stringValue": "519.60"
  },
  {
    "stringValue": "2027-05-01"
  },
  {
    "stringValue": "2598.00"
  },
  {
    "longValue": 666000002
  },
  {
    "stringValue": "1.88"
  },
  {
    "stringValue": "b190"
  },
  {
    "stringValue": "1996-03-25"
  },
  {
    "stringValue": "Brasileiro"
  },
  {
    "stringValue": "masculino"
  },
  {
    "stringValue": "false"
  },
  {
    "stringValue": "SP"
  },
  {
    "stringValue": "São Paulo"
  },
  {
    "longValue": 111111111
  },
  {
    "longValue": 1111111111
  },
  {
    "booleanValue": true
  },
  {
    "stringValue": "LOJAS S.A."
  },
  {
    "stringValue": "99999999999999"
  }
    ]
  ]
}

I need this data to be a pandas dataframe, so I tried to json_normalize the response JSON and got the following result:

bd1 = pd.json_normalize(response,['records'])
print(bd1)

0   ...                                 20
0    {'stringValue': '2022-05-02'}  ...  {'stringValue': '99999999999999'}
1    {'stringValue': '2022-05-02'}  ...  {'stringValue': '99999999999999'}
2    {'stringValue': '2022-05-02'}  ...  {'stringValue': '99999999999999'}
3    {'stringValue': '2022-05-02'}  ...  {'stringValue': '99999999999999'}
4    {'stringValue': '2022-05-02'}  ...  {'stringValue': '99999999999999'}

Can you guys suggest any method to create or convert this to an only values Dataframe?


Solution

  • How about we first parsing the records to standard python objects, and then we handle the JSON-like python structure to dataframe. Assume that you've already parsed the records to nested-list of dicts like the following:

    true, false, null = True, False, None
    
    records = [
        [
            {
                "stringValue": "2022-05-02"
            },
            {
                "longValue": 1
            },
            {
                "longValue": 1
            },
            {
                "stringValue": "a3789"
            },
            {
                "stringValue": "519.60"
            },
            {
                "stringValue": "2023-05-02"
            },
            {
                "stringValue": "2598.00"
            },
            {
                "longValue": 666000002
            },
            {
                "stringValue": "1.88"
            },
            {
                "stringValue": "b190"
            },
            {
                "stringValue": "1996-03-25"
            },
            {
                "stringValue": "Brasileiro"
            },
            {
                "stringValue": "masculino"
            },
            {
                "stringValue": "false"
            },
            {
                "stringValue": "SP"
            },
            {
                "stringValue": "São Paulo"
            },
            {
                "longValue": 111111111
            },
            {
                "longValue": 1111111111
            },
            {
                "booleanValue": true
            },
            {
                "stringValue": "LOJAS S.A."
            },
            {
                "stringValue": "99999999999999"
            }
        ],
        [
            {
                "stringValue": "2022-05-03"
            },
            {
                "longValue": 1
            },
            {
                "longValue": 2
            },
            {
                "stringValue": "a3789"
            },
            {
                "stringValue": "519.60"
            },
            {
                "stringValue": "2024-05-01"
            },
            {
                "stringValue": "2598.00"
            },
            {
                "longValue": 666000002
            },
            {
                "stringValue": "1.88"
            },
            {
                "stringValue": "b190"
            },
            {
                "stringValue": "1996-03-25"
            },
            {
                "stringValue": "Brasileiro"
            },
            {
                "stringValue": "masculino"
            },
            {
                "stringValue": "false"
            },
            {
                "stringValue": "SP"
            },
            {
                "stringValue": "São Paulo"
            },
            {
                "longValue": 111111111
            },
            {
                "longValue": 1111111111
            },
            {
                "booleanValue": true
            },
            {
                "stringValue": "LOJAS S.A."
            },
            {
                "stringValue": "99999999999999"
            }
        ],
    ]
    

    Here we start to extract the values you focus on:

    def first(seq):
        return next(iter(seq))
    
    
    import pandas as pd
    
    records_values = [[first(item.values()) for item in record] for record in records]
    df = pd.DataFrame(records_values)
    print(df)
    

    The Output is:

               0   1   2      3   ...          17    18          19              20
    0  2022-05-02   1   1  a3789  ...  1111111111  True  LOJAS S.A.  99999999999999
    1  2022-05-03   1   2  a3789  ...  1111111111  True  LOJAS S.A.  99999999999999
    
    [2 rows x 21 columns]
    

    And if you want to keep the correspond value types, you can extract the value types from one item of the records like this, and do the related type-casting later in pandas:

    candidate = records[0]
    value_types = [first(item.keys()) for item in candidate]
    # ['stringValue', 'longValue', 'longValue', 'stringValue', 'stringValue', 'stringValue', 'stringValue',
    # 'longValue', 'stringValue', 'stringValue', 'stringValue', 'stringValue', 'stringValue', 'stringValue',
    # 'stringValue', 'stringValue', 'longValue', 'longValue', 'booleanValue', 'stringValue', 'stringValue']