Search code examples
pythonpandaspyyaml

Missing first document when loading multi-document yaml file in pandas dataframe


I tried to load a multi-document yaml file (i.e., a yaml file consting of multiple yaml documents separated by "---") into a Pandas dataframe. For some reason, the first document does not end up in the dataframe. If the output of yaml.safe_load_all is first materialized into a list (instead of feeding the iterator to pd.io.json.json_normalize),all documents end up in the dataframe. I could reproduce this with the example code below (on an entirely different yaml file).

import os
import yaml
import pandas as pd
import urllib.request

# public example of multi-document yaml
inputfilepath = os.path.expanduser("~/my_example.yaml")
url =  "https://raw.githubusercontent.com/kubernetes/examples/master/guestbook/all-in-one/guestbook-all-in-one.yaml"
urllib.request.urlretrieve(url, inputfilepath) 

with open(inputfilepath, 'r') as stream:
     df1 = pd.io.json.json_normalize(yaml.safe_load_all(stream))

with open(inputfilepath, 'r') as stream:
     df2 = pd.io.json.json_normalize([ x for x in yaml.safe_load_all(stream)])

print(f'Output table shape with iterator: {df1.shape}')
print(f'Output table shape with iterator materialized as list: {df2.shape}')

I expect both results to be identical, but I get:

Output table shape with iterator: (5, 18)
Output table shape with iterator materialized as list: (6, 18)

Any ideas why these results differ?


Solution

  • See this site for list comprehension vs. generator expressions.

    df1 is missing the first row of data because you are passing a iterator instead of an iterable.

    print(yaml.safe_load_all(stream))
    #Output: <generator object load_all at 0x00000293E1697750>
    

    From the pandas docs, it is expecting a list:

    data : dict or list of dicts

    Update for more detail:

    From looking into the normalize.py source file, the function json_normalize has this conditional check that's making it so your generator is treated like you passed in a nested structure:

    if any([isinstance(x, dict)
        for x in compat.itervalues(y)] for y in data):
            # naive normalization, this is idempotent for flat records
            # and potentially will inflate the data considerably for
            # deeply nested structures:
            #  {VeryLong: { b: 1,c:2}} -> {VeryLong.b:1 ,VeryLong.c:@}
            #
            # TODO: handle record value which are lists, at least error
            #       reasonably
            data = nested_to_record(data, sep=sep)
        return DataFrame(data)
    

    Inside of the nested_to_record function:

    new_d = copy.deepcopy(d)
    for k, v in d.items():
        # each key gets renamed with prefix
        if not isinstance(k, compat.string_types):
            k = str(k)
        if level == 0:
            newkey = k
        else:
            newkey = prefix + sep + k
    
        # only dicts gets recurse-flattend
        # only at level>1 do we rename the rest of the keys
        if not isinstance(v, dict):
            if level != 0:  # so we skip copying for top level, common case
                v = new_d.pop(k)
                new_d[newkey] = v
            continue
        else:
            v = new_d.pop(k)
            new_d.update(nested_to_record(v, newkey, sep, level + 1))
    new_ds.append(new_d)
    

    The line d.items() is where your generator is evaluated and inside the loop is where you can see they skip the first "level", which in your case is the first record.