Search code examples
pythonpython-3.xnestedyaml

Create a list of nested strings followed by dot from nested yaml


I have an input in yaml with various levels of nested objects. I need a python function to go over it all and get the desired output - list of strings where each field is separated by dot if its nested - Object1.Object2.Object3.Object4... Examples below.

I am trying to achieve it with a recursive function. My code snippet:

tests = []
test2 = {}
def test(config, parent=None):
    previous_parent = None
    names = []
    for column in config:
    
        if column.get("dtype") in ["array", "struct"]:
            parent = column["name"]
            print(f"parent: {parent}")
            test(column["columns"], parent)
            
        else:
            value = column["name"]
            print(f"value: {value}")
            # names.append(value)

And the output is:

value: PartitionDate
value: TransactionID
value: EventTimestamp
parent: ControlTransaction
value: StoreID
parent: RetailTransaction
value: StoreID
value: WorkstationID
...

Input:

columns:
- name: PartitionDate
- name: TransactionID
- name: EventTimestamp
- name: ControlTransaction
  dtype: struct
  columns:
    - name: StoreID
    - name: WorkstationID
    - name: Transaction
      dtype: struct
      columns:
        - name: TransactionID
    - name: TransactionNumber
- name: ControlType
- name: RetailTransaction
  dtype: struct
  columns:
    - name: StoreID
    - name: WorkstationID

Output:

[
PartitionDate,
TransactionID,
EventTimestamp,
ControlTransaction.StoreID,
ControlTransaction.WorkstationID,
ControlTransaction.Transaction.TransactionID,
ControlTransaction.TransactionNumber,
ControlType,
RetailTransaction.StoreID,
RetailTransaction.WorkstationID
]

Solution

  • Just a few changes:

    1. Replace a parent=None parameter with parents=[] to provide a complete list of parent names.
    2. If a column contains nested "columns":
      • Add its name to a parent list.
      • Obtain the value path of nested items using recursion.
      • Add these values to a resulting names.
    3. If a column doesn't contain nested "columns": combine its name with parents and join this list with a . separator.
    4. Return the names.
    import yaml
    
    def test(config, parents=[]):
        names = []
        for column in config:
            
            if column.get("dtype") in ["array", "struct"] and "columns" in column:
                cur_parents = parents.copy()
                cur_parents.append(column["name"])
                children = test(column["columns"], cur_parents)
                names.extend(children)
    
            else:
                value = column["name"]
                value_path = parents + [value]
                names.append(".".join(value_path))
    
        return names
    
    with open("input.yaml", "r") as inp:
        yaml_conf = yaml.safe_load(inp)
    
    values = test(yaml_conf.get("columns"))
    
    print("[\n{}\n]".format(",\n".join(values)))
    

    Edit: make sure to check the important notes made by @Anthon in the comments and in another answer.