Search code examples
pythonamazon-web-servicespysparkaws-glue

Reading Dynamic DataTpes from S3 with AWS Glue


I have json stored in S3. Sometimes units is stored as a string, sometimes it's stored as an integer. Unfortunately, this was a bug, and I now have billions of records with mixmatched datatypes in the source json.

example:

{
  "other_stuff": "stuff"
  "units": 2,
{
{
  "other_stuff": "stuff"
  "units": "2",
{

I want to dynamically determine if it's a string / integer, and then target it as an integer into AWS Redshift.

If my mappings is: ("units", "string", "units", "int"), only the "string" values will be converted correctly. If i do ("units", "int", "units", "int") then it's the opposite, only the "integer" ones will work.

How do I dynamically cast the source record, and always load it as a integer into Redshift. You can assume, that all values are numeric, not null, and the attribute is guaranteed to be there.


Solution

  • You can use the ResolveChoices function from Glue.

    resolved_choices = df.resolveChoice(
        specs=[
            ('units', 'cast:int')
        ]
    )