Search code examples
arraysjsonmultidimensional-arraydeserializationn-triples

Unflatten JSON objects with indices / deserialize n-triples to hierarchical Excel


I have parsed JSON+LD (structured) data from the tool Screaming Frog. The format this tool exports the data in is unworkable as the parent/child relationship (cross-reference) is not on one row in Excel. Edit: this serialized format is called n-triples. Below is an example output with the index relationships colour-coded (sorry not allowed to post images yet):

https://i.sstatic.net/7Zzp3.jpg

Subject     Predicate                           Object
subject27   schema.org/aggregateRating          subject28
subject27   schema.org/offers                   subject29
subject27   schema.org/operatingSystem          ANDROID
subject27   type                                schema.org/SoftwareApplication
subject28   schema.org/ratingCount              15559
subject28   schema.org/ratingValue              3.597853422
subject28   type                                schema.org/AggregateRating
subject29   schema.org/price                    0
subject29   type                                schema.org/Offer

Below would be an example of the final output required where all nested levels are in it's own column. Each of the nested levels (up to 4 deep) should be mapped into its own column, repeating the parent path information.

Predicate L1                Object L1                       Predicate L2            Object L2
type                        schema.org/SoftwareApplication      
schema.org/operatingSystem  ANDROID     
schema.org/aggregateRating  subject28                       schema.org/ratingCount  15559
schema.org/aggregateRating  subject28                       schema.org/ratingValue  3.597853422
schema.org/aggregateRating  subject28                       type                    schema.org/AggregateRating
schema.org/offers           subject29                       schema.org/price        0
schema.org/offers           subject29                       type                    schema.org/Offer

I have looked for existing unflatten solutions, but these either use the path information stored into a single column (with each "lowest level value" having its own "row") or don't rebuilt the original data based upon indices.

I am looking to do this with a combination of for loops with SQL JOINS, but I feel there must be a more elegant solution. This could be in Python, PHP, JS or SQL or a combination or even adding each "subject" into a MongoDB document and then applying a merge operation on this?

Edit: Updating the title to optimize SEO for this article. The serialized format of this RDF and JSON+LD data I am working with is called N-triples. Read more here: https://medium.com/wallscope/understanding-linked-data-formats-rdf-xml-vs-turtle-vs-n-triples-eb931dbe9827


Solution

  • This is probably all kinds of ugly and very surely un-pythonic in more ways than one, but it gets the job done on your sample data:

    import re
    
    def group_items(items, prop):
        group = {}
        for item in items:
            key = item[prop]
            if key not in group:
                group[key] = []
            group[key].append(item)
        return group
    
    with open('input.txt', encoding='utf8') as f:
        # analyze column widths on the example of the header row
        # this allows for flexible column withds in the input data
        header_row = next(f)
        columns = re.findall('\S+\s*', header_row.rstrip('\n'))
        i = 0
        cols = []
        headers = []
        for c in columns:
            headers.append( c.strip() )
            cols.append( [i, i + len(c)] )
            i += len(c)
        cols[-1][1] = 100000   # generous data length for last column
    
        # extract one item per line, using those column widths
        items = []
        for line in f:
            item = {}
            for c, col in enumerate(cols):
                item[headers[c]] = line[col[0]:col[1]].strip()
            items.append(item)
    
    # group items to figure out which ones are at the root
    items_by_subject = group_items(items, 'Subject')
    items_by_object = group_items(items, 'Object')
    
    # root keys are those that are not anyone else's subject
    root_keys = set(items_by_subject.keys()) - set(items_by_object.keys())
    root_items = [items_by_subject[k] for k in root_keys]
    
    # recursive function to walk the tree and determine the leafs
    leafs = []
    def unflatten(items, parent=None, level=1):
        for item in items:
            item['Parent'] = parent
            item['Level'] = level
            key = item['Object']
            if key in items_by_subject:
                unflatten(items_by_subject[key], item, level+1)
            else:
                leafs.append(item)
    
    # ...which needs to be called for each group of root items
    for group in root_items:
        unflatten(group)
    
    # this is not limited to 4 levels
    max_level = max(item['Level'] for item in leafs)
    
    # recursive function to fill in parent data
    def fill_data(item, output={}):
        parent = item['Parent']
        if parent is not None:
            fill_data(parent, output)
        output['Predicate L%s' % item['Level']] = item['Predicate']
        output['Object L%s' % item['Level']] = item['Object']
    
    # ...which needs to be called once per leaf
    result = []
    for leaf in reversed(leafs):
        output = {}
        for l in range(1, max_level + 1):
            output['Predicate L%s' % l] = None
            output['Object L%s' % l] = None
        fill_data(leaf, output)
        result.append(output)
    
    # output result
    for item in result:
        print(item)
    

    Given your sample input as input.txt, the output is the following me:

    {'Predicate L1': 'type', 'Object L1': 'schema.org/SoftwareApplication', 'Predicate L2': None, 'Object L2': None}
    {'Predicate L1': 'schema.org/operatingSystem', 'Object L1': 'ANDROID', 'Predicate L2': None, 'Object L2': None}
    {'Predicate L1': 'schema.org/offers', 'Object L1': 'subject29', 'Predicate L2': 'type', 'Object L2': 'schema.org/Offer'}
    {'Predicate L1': 'schema.org/offers', 'Object L1': 'subject29', 'Predicate L2': 'schema.org/price', 'Object L2': '0'}
    {'Predicate L1': 'schema.org/aggregateRating', 'Object L1': 'subject28', 'Predicate L2': 'type', 'Object L2': 'schema.org/AggregateRating'}
    {'Predicate L1': 'schema.org/aggregateRating', 'Object L1': 'subject28', 'Predicate L2': 'schema.org/ratingValue', 'Object L2': '3.597853422'}
    {'Predicate L1': 'schema.org/aggregateRating', 'Object L1': 'subject28', 'Predicate L2': 'schema.org/ratingCount', 'Object L2': '15559'}
    

    I'll leave putting this into some sort of file as an exercise.