Search code examples
pythonpandasdataframeyamldenormalization

How to denormalize YAML for Pandas Dataframe?


I am trying to get data from a YAML file into a Pandas DataFrame. Take the following example data.yml:

---
 - doc: "Book1"
   reviews:
     - reviewer: "Paul"
       stars: "5"
     - reviewer: "Sam"
       stars: "2"
 - doc: "Book2"
   reviews:
     - reviewer: "John"
       stars: "4"
     - reviewer: "Sam"
       stars: "3"
     - reviewer: "Pete"
       stars: "2"
...

The desired DataFrame would look like this:

     doc reviews.reviewer reviews.stars
0  Book1             Paul             5
1  Book1              Sam             2
2  Book2             John             4
3  Book2              Sam             3
4  Book2             Pete             2

I've tried feeding the YAML data to Pandas different ways (like with open('data.yml') as f: data = pd.DataFrame(yaml.load(f))), but the cells always contain the nested dicts. This solution works for general JSON data, but it's quite a bit of code and it seems like a simpler solution for YAML might exist.

Is there a built-in way to denormalize YAML for conversion to a Pandas Dataframe in this way?


Solution

  • You should use json_normalize to flatten the dictionary after YAML loads:

    pd.io.json.json_normalize(yaml.load(f), 'reviews', 'doc')
    
      reviewer stars    doc
    0     Paul     5  Book1
    1      Sam     2  Book1
    2     John     4  Book2
    3      Sam     3  Book2
    4     Pete     2  Book2