python python-3.x python-2.7 yaml pyyaml

How do I write a yaml file from a dictionary with python?

I have a csv file which containing data where the header contains keys and the cells contain values. I would like to use python to create a yaml file from the contents of the csv file.

I created a dictionary of the K:V pairs; however, I am stuck trying to get the K:V pairs into the yaml file.

The structure of the yaml must be:

key1: value1
key2: value2
key3:
  -  key4: value4
     key5: {key6: [value6]
key7: value7
key8: value8
key9: value9
  -
  -
---

If I were to manually create these, I would have more than 1000 YAMLs so it's pretty time consuming and unrealistic.

I am looking for any ideas your much more experienced people might have.

I would really like the output to iterate through the dictionary to create a huge listing of YAMLs like below:

key1: value1
key2: value2
key3:
  -  key4: value4
     key5: {key6: [value6]
key7: value7
key8: value8
key9: value9
  -
  -
---
key1: value1
key2: value2
key3:
  -  key4: value4
     key5: {key6: [value6]
key7: value7
key8: value8
key9: value9
  -
  -
---
key1: value1
key2: value2
key3:
  -  key4: value4
     key5: {key6: [value6]
key7: value7
key8: value8
key9: value9
  -
  -
---
key1: value1
key2: value2
key3:
  -  key4: value4
     key5: {key6: [value6]
key7: value7
key8: value8
key9: value9
  -
  -
---

Sample Code:

import csv
import yaml

def csv_dict_list(variables_file) :

    reader=csv.DictReader(open(variables_file, 'r'))
    dict_list = []
    for line in reader:
        dict_list.append(line)
    return dict_list

yaml_values = csv_dict_list(sys.argv[1])

No matter what I try after this, I can not get the desired output using yaml.load() or yaml.load_all().

Solution

First of all, you should use dump() or dump_all(), since you want to write YAML, instead of using load().

You also should also be aware that the CSV reader does return something different on Python 2.7 then e.g. on Python 3.6: on the first you get a list of dict back from csv_dict_list and on the second a list of OrderedDict). That in itself would not be a problem, but PyYAML dumps a dict with the keys sorted, and an ordereddict with a tag.

Your proposed YAML is also not valid, as the flow style mapping in the line:

 key5: {key6: [value6]

is not terminated with a } before the end of the document, you also cannot have:

key9: value9
  -
  -

either use:

key9: value9
key10:
  -
  -

key9: 
  - value9
  -

or something similar (there is also no equivalent Python data structure that has both a value and a list for one and the same key, so cannot actually create something like that even in Python).

PyYAML additionally lacks the support for indenting your block style sequence. If you do:

import yaml
print(yaml.dump(dict(x=[dict(a=1, b=2)]), indent=4))

the output will still be flush left:

x:
- {a: 1, b: 2}

To prevent all these problems you will run into when using PyYAML, and to circumvent the differences in Python versions, I recommend you use ruamel.yaml (disclaimer: I am the author of that package), and the following code:

import sys
import csv
import ruamel.yaml

Dict = ruamel.yaml.comments.CommentedMap

def csv_dict_list(variables_file) :
    reader=csv.reader(open(variables_file, 'r'))
    key_list = None
    dict_list = []
    for line in reader:
        if key_list is None:
            key_list = line
            continue
        d = Dict()
        for idx, v in enumerate(line):
            k = key_list[idx]
            # special handling of key3/key4/key5/key6
            if k == key_list[2]:
                d[k] = []
            elif k == key_list[3]:
                d[key_list[2]].append(Dict([(k, v)]))
            elif k == key_list[4]:
                d[key_list[2]][0][k] = dt = Dict()
                dt.fa.set_flow_style()
            elif k == key_list[5]:
                d[key_list[2]][0][key_list[4]][k] = [v]
            else:
                d[k] = v
        dict_list.append(d)
    return dict_list

data = csv_dict_list('test.csv')


yaml = ruamel.yaml.YAML()
yaml.indent(sequence=4, offset=2)
yaml.dump_all(data, sys.stdout)

With test.csv:

key1,key2,key3,key4,key5,key6,key7,key8,key9
value_a1,value_a2,value_a3,value_a4,value_a5,value_a6,value_a7,value_a8,value_a9
value_b1,value_b2,value_b3,value_b4,value_b5,value_b6,value_b7,value_b8,value_b9

this gives:

key1: value_a1
key2: value_a2
key3:
  - key4: value_a4
    key5: {key6: [value_a6]}
key7: value_a7
key8: value_a8
key9: value_a9
---
key1: value_b1
key2: value_b2
key3:
  - key4: value_b4
    key5: {key6: [value_b6]}
key7: value_b7
key8: value_b8
key9: value_b9

on both Python 2.7 and Python 3.6