Search code examples
pythondata-structurestreehierarchical-trees

Creating hierarchy tree from dictionary of pages' contents


The following key:value pairs are 'page' and 'page contents'.

{
  'section-a.html':{'contents':'section-b.html section-c.html section-d.html'},
  'section-b.html':{'contents':'section-d.html section-e.html'},
  'section-c.html':{'contents':'product-a.html product-b.html product-c.html product-d.html'},
  'section-d.html':{'contents':'product-a.html product-c.html'},
  'section-e.html':{'contents':'product-b.html product-d.html'},
  'product-a.html':{'contents':''},
  'product-b.html':{'contents':''},
  'product-c.html':{'contents':''},
  'product-d.html':{'contents':''}
}

For any given 'item' how could I find the path(s) to said item? With my very limited knowledge of data structures in most cases, I'm assuming this would be a hierarchy tree. Please correct me if I'm wrong!

UPDATE: My apologies, I should have been more clear about the data and my expected outcome.

Assuming 'page-a' is an index, each 'page' is literally a page appearing on a website, where as each 'item' is something like a product page that would appear on Amazon, Newegg, etc.

Thus, my expected output for 'item-d' would be a path (or paths) to that item. For example (delimiter is arbitrary, for illustration here): item-d has the following paths:

page-a > page-b > page-e > item-d
page-a > page-c > item-d

UPDATE2: Updated my original dict to provide more accurate and real data. '.html' added for clarification.


Solution

  • Here's a simple approach -- it's O(N squared), so, not all that highly scalable, but will serve you well for a reasonable book size (if you have, say, millions of pages, you need to be thinking about a very different and less simple approach;-).

    First, make a more usable dict, mapping page to set of contents: e.g., if the original dict is d, make another dict mud as:

    mud = dict((p, set(d[p]['contents'].split())) for p in d)
    

    Then, make the dict mapping each page to its parent pages:

    parent = dict((p, [k for k in mud if p in mud[k]]) for p in mud)
    

    Here, I'm using lists of parent pages (sets would be fine too), but that's OK for pages with 0 or 1 parents as in your example, too -- you'll just be using an empty list to mean "no parent", else a list with the parent as the one and only item. This should be an acyclic directed graph (if you're in doubt, you can check, of course, but I'm skipping that check).

    Now, given a page, finding the paths up its parent(s) to a parentless-parent ("root page") simply require "walking" the parent dict. E.g., in the 0/1 parent case:

    path = [page]
    while parent[path[-1]]:
      path.append(parent[path[-1]][0])
    

    If you can clarify your specs better (ranges of number of pages per book, number of parents per page, and so on), this code can no doubt be refined, but as a start I hope it can help.

    Edit: as the OP clarified that cases with > 1 parent (and so, multiple paths) are indeed of interest, let me show how do deal with that:

    partial_paths = [ [page] ]
    while partial_paths:
      path = partial_paths.pop()
      if parent[path[-1]]:
        # add as many partial paths as open from here
        for p in parent[path[-1]]:
          partial_paths.append(path + [p])
      else:
        # we've reached a root (parentless node)
        print(path)
    

    Of course, instead of printing, you can yield each path when it reaches a root (making the function whose body this is into a generator), or otherwise treat it in whatever way you require.

    Edit again: a commenter is worried about cycles in the graph. If that worry's warranted, it's not hard to keep track of nodes already seen in a path and detect and warn about any cycles. Fastest is to keep a set alongside each list representing a partial path (we need the list for ordering, but checking for membership is O(1) in sets vs O(N) in lists):

    partial_paths = [ ([page], set([page])) ]
    while partial_paths:
      path, pset = partial_paths.pop()
      if parent[path[-1]]:
        # add as many partial paths as open from here
        for p in parent[path[-1]]:
          if p in pset:
            print('Cycle: %s (%s)' % (path, p))
            continue
          partial_paths.append((path + [p], pset.union([p])))
      else:
        # we've reached a root (parentless node)
        print('Path: %s' % (path,))
    

    It's probably worthwhile, for clarity, packing the list and set representing a partial path into a small utility class Path with suitable methods.