python-3.xyamlextract

Regex to extract a list from a yaml string


I have contents of yaml file as a string. I want to look through these and fetch any occurrence of a list named users

yaml_file =  base64.b64decode(encoded_data).decode('utf-8')
users = []
# fetch users from yaml_file

The list can appear anywhere in the files, such as parent element or a child of any degree, hence reading the file as yaml and parsing it won’t be useful since there is no one structure.

...
users:
  - user1
  - user2
  - user3
...

Is there a regex that I can use to fetch only the list of name users from a yaml string?


Solution

  • Parsing structured data with regex is almost always going to result in unreliable behaviors.

    You can instead use a function that traverses sub-dicts and sub-lists recursively until a sub-list under a key with a specified name is found:

    def find_named_list(data, name):
        if isinstance(data, dict):
            for key, value in data.items():
                if key == name and isinstance(value, list):
                    return value
                if (lst := find_named_list(value, name)) is not None:
                    return lst
        elif isinstance(data, list):
            for value in data:
                if (lst := find_named_list(value, name)) is not None:
                    return lst
    

    so that:

    import yaml
    
    data = '''---
    foo:
    - hello: 1
      world:
        users:
          - user1
          - user2
          - user3
    - stack: overflow
    bar: ''
    '''
    
    print(find_named_list(yaml.safe_load(data), 'users'))
    

    outputs:

    ['user1', 'user2', 'user3']
    

    Demo: https://ideone.com/Mdjixm

    To find all sub-lists under a key with a specified name, create a generator using the yield statement instead:

    def find_named_list(data, name):
        if isinstance(data, dict):
            for key, value in data.items():
                if key == name and isinstance(value, list):
                    yield value
                yield from find_named_list(value, name)
        elif isinstance(data, list):
            for value in data:
                yield from find_named_list(value, name)
    

    so that:

    data = '''---
    foo:
    - hello: 1
      world:
        users:
        - user1
        - user2
        - user3
    - stack: overflow
    bar:
      users:
      - user_a
      - user_b
      - user_c
    '''
    
    print(list(find_named_list(yaml.safe_load(data), 'users')))
    

    outputs:

    [['user1', 'user2', 'user3'], ['user_a', 'user_b', 'user_c']]
    

    Demo: https://ideone.com/eJvw4M