Search code examples
python-3.xregexvalidationparsingpython-re

Extract columns from Hive Struct data type represented as string in python


In my python program, I need to write a function which takes hive data type as input and return whether or not the data type is valid or not.

Primitive data types supported in hive are as follows :

supported_data_types: Set = {'void', 'boolean', 'tinyint', 'smallint', 'int', 'bigint', 'float', 'double',
                                  'decimal', 'string', 'varchar', 'timestamp', 'date', 'binary'}

Complex data types supported by hive are :

arrays: array<data_type>
maps: map<primitive_type, data_type>
structs: struct<col_name : data_type [comment col_comment], ...>
union: union<data_type, data_type, ...>

In my python program, I have one variable which stores primitive or complex hive data type. I need to write a function which can return True if the data type of the variable is valid, otherwise I have to return false.

Primitive data types are easy to validate, I just need to write :

def validate_datatype(_type: str):
    _type = _type.strip()

    if _type in HIVE_SUPPORTED_DATA_TYPES:
        return True

I also managed to validate array and map data types, to any level of nesting. I observe the fact that for array, any valid type has syntax : array<data_type>. So, if I have a type array<data_type>, I recursively validate data_type. In case of map, its key is always primitive datatype, so, it is also possible to validate map data type recursively. I used following function to recursively validate data types.

def validate_datatype(_type: str) -> bool:
    _type = _type.strip()

    if _type in HIVE_SUPPORTED_DATA_TYPES:
        return True

    # Array type has syntax : array<data_type>, where data_type is any valid data_type
    if _type.startswith('array<'):
        assert _type.endswith(">"), "could not find matching '>' for data type 'array<' "
        return validate_datatype(_type[6:-1])

    # Map type has syntax : map<primitive_type, data_type>, where primitive type can only be one of
    # primitive types present in HIVE_SUPPORTED_DATA_TYPES. data_type can be any(primitive or complex)
    # valid hive data type
    if _type.startswith('map<'):
        assert _type.endswith(">"), "could not find matching '>' for data type 'map<' "
        primitive_type, data_type = _type[4:-1].split(',', 1)
        if primitive_type.strip() in HIVE_SUPPORTED_DATA_TYPES and validate_datatype(data_type):
            return True

Such recursive function is not possible (at least in my opinion) to validate struct data type, which has following syntax : struct<col1_name : data_type, col2_name : data_type, ... > (Assume that I don't have to worry about [COMMENT col_comment] part right now. I don't receive any inputs with column comments.)

To validate struct data type, I first need to write some other function to extract columns from the struct data type. Let this abstract function be extract_columns which has following syntax.

@abstractmethod
def extract_columns(dt: str) -> List[Tuple[str, str]]:
    pass

I should get following results for following inputs :

dt = "struct<col_1: int, col_2 : struct<nested_col_1 : int, nested_col_2 : str>>"

extract_columns(dt)
>> [('col_1','int'), ('col_2', 'struct<nested_col_1 : int, nested_col_2 : str>')]

If I can manage to write such a function, I can manage to recursively validate the data_types. Since struct can contain many internal struct and other complex data types, I can not use split at : (might also split the nested structs, which I don't want to), or , (might also split at nested map and nested struct types, again, which I don't want to).

So, in my opinion, re might be able to extract such list of columns, but, I'm not able to figure our re for this. Can anyone help me here? Many thanks in advance.


Solution

  • The task here is to find all occurrences of commas and colons that have the same number of < as > to their left. Although regular expressions do have fancy features like the ability to store variables while searching, I think this problem is beyond their scope. In general, regex is only useful for simple pattern matching, like identifying URLs. It's easy to get carried away and to want to apply regex everywhere, but this often results in code plagued by unreadable nightmare strings.

    The easiest solution here is plain ol' iteration. To implement your extract_columns function, you'll need a split function that ignores separators that are nested inside of a <>:

    def first_index(target: str, string: str) -> int:
        # Finds the index of the first top-level occurrence of `target` in `string`
        depth = 0
        for i in range(len(string)):
            if string[i] == '>':
                depth -= 1
            if string[i] == target and depth == 0:
                return i
            if string[i] == '<':
                depth += 1
        # No matches were found
        return len(string)
    def top_level_split(string: str, separator: str) -> [str]:
        # Splits `string` on every occurrence of `separator` that is not nested in a "<>"
        split = []
        while string != '':
            index = first_index(target=separator, string=string)
            split.append(string[:index])
            string = string[index + 1 :]
        return split
    

    The full implementation would look like:

    def is_valid_datatype(string: str) -> bool:
        
        if is_valid_primitive(string):
            return True
        
        # Find the opening carrot
        left = first_index(target='<', string=string)
        # Find the closing carrot
        right = first_index(target='>', string=string)
        if left > right or right == len(string):
            return False
        # Make sure there isn't anything to the right of `right`
        if string[right + 1 : ].strip() != '':
            return False
        
        # The name of the data type, e.g. "array"
        type_name = string[ : left].strip()
        # The substring between the carrots
        contents = string[left + 1 : right]
        
        if type_name == 'array':
            return is_valid_datatype(contents)
        if type_name == 'map':
            # We don't need `top_level_split` here because the first type is always primitive
            split = contents.split(',', 1)
            return len(split) == 2 and is_valid_primitive(split[0]) and is_valid_datatype(split[1])
        if type_name == 'struct':
            # Get each column by splitting on top-level commas
            for column in top_level_split(string=contents, separator=','):
                # We don't need `top_level_split` here because the first type is a column name
                split = column.split(':', 1)
                if not (len(split) == 2 and is_valid_colname(split[0]) and is_valid_datatype(split[1])):
                    return False
            # All columns were valid!
            return True
        if type_name == 'union':
            for entry in top_level_split(string=contents, separator=','):
                if not is_valid_datatype(entry):
                    return False
            # All entries were valid!
            return True
    
        # The type name is not recognized
        return False
    

    I'll leave it up to you to implement is_valid_colname and to allow comments in structs.

    Note that this algorithm is worst case O(N^2), where N is the length of the input string. is_valid_datatype is called O(N) times, each time invoking top_level_split which is also O(N).

    If that's too slow, it is possible to achieve linear time. Every data type can be thought of as a tree whose nodes represent nested data types. You could depth-first search through the input string by keeping a stack of all the data types that are ancestors to the data type you're currently parsing. Every time you see array<, for instance, you'd push array onto the stack. Whenever you see >, you'd pop from the stack. This can be done in one sweep through the string. Much more convoluted than the above code, though.

    Also, general suggestion: try to return False instead of using assert. That way you don't have to worry about code crashing when invalid strings are entered.

    Good luck with the project!