I am currently trying to automate the generation of a document using HTML files that exist locally on my machine. Each HTML document is named for the object it describes, and I am only interested in getting the name of each objects properties and the data type of each property, and preserving the hierarchical relationships between certain objects.
I have the following code so far:
import os
from lxml import html
fileList = []
for folderName, subFolders, filenames in os.walk("Path/To/Relevant/Files"):
for filename in filenames:
fileList.append(folderName + "/" + filename)
propertyDictList = []
for i in range(0, len(fileList)):
file = open(fileList[i])
page = file.read()
tree = html.fromstring(page)
propertyNameXpath = tree.xpath("//someXpathquery")
propertyNames = [str(i) for i in propertyNameXpath]
propertyTypeXpath = tree.xpath("//anotherXpathquery")
propertyTypes = [str(i) for i in propertyTypeXpath]
propertyDict = dict(zip(propertyNames, propertyTypes))
propertyDictList.append(propertyDict)
This code gets the names and data types of every property from every file in the directory, and puts them into key-value pairs as the entries of dictionaries, one dictionary for each file. These dictionaries are then appended to propertyDictList
.
What am I trying to figure out now is how I can reestablish the hierarchical relationships between certain objects. Say, for example, that I have a file that describes an object "foo." Let's call the filename Path/To/Relevant/Files/foo.html
. Now, this "foo" object may have several properties, such that the dictionary describing it looks as follows:
{"bar" : "string", "baz" : "int", "fizz" : "buzz"}
The "buzz"
data type actually refers to another object that exists in the directory, described in Path/To/Relevant/Files/buzz.html
. What I would like to do is compare the values of my dictionaries against the list of filenames in the directory, and in the event there is a match between some dictionary value and an item in the filename list, the dictionary extracted from the matching file is put in place of the value. e.g.
{"bar" : "string", "baz" : "int", "fizz" : { "baa" : "ram" , "ewe" : "fleece" }}
In your current code you don't store the mapping from the file names to the properties extracted from the file. Assuming you add that, the expansion you're talking about is relatively straightforward:
props_by_file = {
"foo": {"bar" : "string", "baz" : "int", "fizz" : "buzz"},
"buzz": { "baa" : "ram" , "ewe" : "fleece" }
}
for file_props in props_by_file.values():
for k, v in file_props.items():
if v in props_by_file:
file_props[k] = props_by_file[v]
props_by_file
# {'foo': {'bar': 'string', 'baz': 'int', 'fizz': {'baa': 'ram', 'ewe': 'fleece'}},
# 'buzz': {'baa': 'ram', 'ewe': 'fleece'}
# }