Search code examples
pythonjsonpython-2.7yaml

Python: Convert multiple YAML documents to JSON


I'm currently trying to put convert some YAML into JSON using python, and am having a hard time getting the JSON formatted properly. My YAML file has multiple documents that look like this:

title: Windows Shell Spawning Suspicious Program
status: experimental
description: Detects a suspicious child process of a Windows shell
references:
    - https://mgreen27.github.io/posts/2018/04/02/DownloadCradle.html
author: Florian Roth
date: 20018/04/06
logsource:
    product: windows
    service: sysmon
detection:
    selection:
        EventID: 1
        ParentImage:
            - '*\mshta.exe'
            - '*\powershell.exe'
            - '*\cmd.exe'
            - '*\rundll32.exe'
            - '*\cscript.exe'
            - '*\wscript.exe'
            - '*\wmiprvse.exe'
        Image:
            - '*\schtasks.exe'
            - '*\nslookup.exe'
            - '*\certutil.exe'
            - '*\bitsadmin.exe'
            - '*\mshta.exe'
    condition: selection
fields:
    - CommandLine
    - ParentCommandLine
falsepositives:
    - Administrative scripts
level: medium
...

What I'm trying to do is for every document, pull the detection, fields, falsepositives, and level and put those into a JSON document as individual arrays. My first attemp was pretty poor, and just lumped the groups from every document into lists:

data = {}
data['indicator'] = {}
data['indicator']['detection']=[]
data['indicator']['fields']=[]
data['indicator']['false positives']=[]
data['indicator']['level']=[]
with open(yaml_file, 'r') as yaml_in, open(json_file, 'a') as definition:
     loadyaml = yaml.safe_load_all(yaml_in)
     for item in loadyaml:
         for header, subsections in item.iteritems():
             if header == 'detection':
                 data['indicator']['detection'].append(subsections)
             elif header == 'fields':
                 data['indicator']['fields'].append(subsections)
             elif header == 'false positives':
                 data['indicator']['false positives'].append(subsections)
             elif header == 'level':
                 data['indicator']['level'].append(subsections)

     json.dump(data, definition, indent=4)

I'd like for each of my documents to be entered into my json doc as individual indicators, with their detection, fields, dalspositives, and levels all grouped together -- but my python abilities are failing me.

Any insight I could get on this would be greatly appreciated!


Solution

  • You can get the output you want by iterating over .load_all() and a much smaller program:

    import sys
    import ruamel.yaml
    import json
    
    yaml = ruamel.yaml.YAML(typ='safe')
    ind = dict()
    data = dict(indicator=ind)
    for d in yaml.load_all(open('input.yaml')):
        for k in ('detection', 'fields', 'falsepositives', 'level'):
            ind.setdefault(k, []).append(d[k])
    
    json.dump(data, sys.stdout, indent=2)
    

    If you have a file input.yaml:

    ---
    title: Windows Shell Spawning Suspicious Program
    status: experimental
    description: Detects a suspicious child process of a Windows shell
    references:
        - https://mgreen27.github.io/posts/2018/04/02/DownloadCradle.html
    author: Florian Roth
    date: 20018/04/06
    logsource:
        product: windows
        service: sysmon
    detection:
        selection:
            EventID: 1
            ParentImage:
                - '*\mshta.exe'
                - '*\powershell.exe'
                - '*\cmd.exe'
                - '*\rundll32.exe'
                - '*\cscript.exe'
                - '*\wscript.exe'
                - '*\wmiprvse.exe'
            Image:
                - '*\schtasks.exe'
                - '*\nslookup.exe'
                - '*\certutil.exe'
                - '*\bitsadmin.exe'
                - '*\mshta.exe'
        condition: selection
    fields:
        - CommandLine
        - ParentCommandLine
    falsepositives:
        - Administrative scripts
    level: medium
    ...
    ---
    title: Bash starting just what is asked
    status: stabel
    description: No negative side effects
    references:
        - https://nblue24.github.io/posts/2019/04/01/DownloadBed.html
    author: Axel Roth
    date: 2019/04/01
    logsource:
        product: linux
        service: good
    detection:
        selection:
            EventID: 42
            ParentImage:
                - '*/bash'
                - '*/ash'
            Image:
                - systemctl
                - init
        condition: selection
    fields:
        - Shell
        - ParentShell
    falsepositives:
        - root programs
    level: high
    ...
    

    Your output will be:

    {
      "indicator": {
        "detection": [
          {
            "selection": {
              "EventID": 1,
              "ParentImage": [
                "*\\mshta.exe",
                "*\\powershell.exe",
                "*\\cmd.exe",
                "*\\rundll32.exe",
                "*\\cscript.exe",
                "*\\wscript.exe",
                "*\\wmiprvse.exe"
              ],
              "Image": [
                "*\\schtasks.exe",
                "*\\nslookup.exe",
                "*\\certutil.exe",
                "*\\bitsadmin.exe",
                "*\\mshta.exe"
              ]
            },
            "condition": "selection"
          },
          {
            "selection": {
              "EventID": 42,
              "ParentImage": [
                "*/bash",
                "*/ash"
              ],
              "Image": [
                "systemctl",
                "init"
              ]
            },
            "condition": "selection"
          }
        ],
        "fields": [
          [
            "CommandLine",
            "ParentCommandLine"
          ],
          [
            "Shell",
            "ParentShell"
          ]
        ],
        "falsepositives": [
          [
            "Administrative scripts"
          ],
          [
            "root programs"
          ]
        ],
        "level": [
          "medium",
          "high"
        ]
      }
    }
    

    This works on both Python 2 and 3.