Search code examples
pythonjsonparsingtextpyparsing

Pyparsing: Parsing semi-JSON nested plaintext data to a list


I have a bunch of nested data in a format that loosely resembles JSON:

company="My Company"
phone="555-5555"
people=
{
    person=
    {
        name="Bob"
        location="Seattle"
        settings=
        {
            size=1
            color="red"
        }
    }
    person=
    {
        name="Joe"
        location="Seattle"
        settings=
        {
            size=2
            color="blue"
        }
    }
}
places=
{
    ...
}

There are many different parameters with varying levels of depth--this is just a very small subset.

It also might be worth noting that when a new sub-array is created that there is always an equals sign followed by a line break followed by the open bracket (as seen above).

Is there any simple looping or recursion technique for converting this data to a system-friendly data format such as arrays or JSON? I want to avoid hard-coding the names of properties. I am looking for something that will work in Python, Java, or PHP. Pseudo-code is fine, too.

I appreciate any help.

EDIT: I discovered the Pyparsing library for Python and it looks like it could be a big help. I can't find any examples for how to use Pyparsing to parse nested structures of unknown depth. Can anyone shed light on Pyparsing in terms of the data I described above?

EDIT 2: Okay, here is a working solution in Pyparsing:

def parse_file(fileName):

#get the input text file
file = open(fileName, "r")
inputText = file.read()

#define the elements of our data pattern
name = Word(alphas, alphanums+"_")
EQ,LBRACE,RBRACE = map(Suppress, "={}")
value = Forward() #this tells pyparsing that values can be recursive
entry = Group(name + EQ + value) #this is the basic name-value pair


#define data types that might be in the values
real = Regex(r"[+-]?\d+\.\d*").setParseAction(lambda x: float(x[0]))
integer = Regex(r"[+-]?\d+").setParseAction(lambda x: int(x[0]))
quotedString.setParseAction(removeQuotes)

#declare the overall structure of a nested data element
struct = Dict(LBRACE + ZeroOrMore(entry) + RBRACE) #we will turn the output into a Dictionary

#declare the types that might be contained in our data value - string, real, int, or the struct we declared
value << (quotedString | struct | real | integer)

#parse our input text and return it as a Dictionary
result = Dict(OneOrMore(entry)).parseString(inputText)
return result.dump()

This works, but when I try to write the results to a file with json.dump(result), the contents of the file are wrapped in double quotes. Also, there are \n chraacters between many of the data pairs. I tried suppressing them in the code above with LineEnd().suppress() , but I must not be using it correctly.


Solution

  • Okay, I came up with a final solution that actually transforms this data into a JSON-friendly Dict as I originally wanted. It first using Pyparsing to convert the data into a series of nested lists and then loops through the list and transforms it into JSON. This allows me to overcome the issue where Pyparsing's toDict() method was not able to handle where the same object has two properties of the same name. To determine whether a list is a plain list or a property/value pair, the prependPropertyToken method adds the string __property__ in front of property names when Pyparsing detects them.

    def parse_file(self,fileName):
    
                #get the input text file
                file = open(fileName, "r")
                inputText = file.read()
    
    
                #define data types that might be in the values
                real = Regex(r"[+-]?\d+\.\d*").setParseAction(lambda x: float(x[0]))
                integer = Regex(r"[+-]?\d+").setParseAction(lambda x: int(x[0]))
                yes = CaselessKeyword("yes").setParseAction(replaceWith(True))
                no = CaselessKeyword("no").setParseAction(replaceWith(False))
                quotedString.setParseAction(removeQuotes)
                unquotedString =  Word(alphanums+"_-?\"")
                comment = Suppress("#") + Suppress(restOfLine)
                EQ,LBRACE,RBRACE = map(Suppress, "={}")
    
                data = (real | integer | yes | no | quotedString | unquotedString)
    
                #define structures
                value = Forward()
                object = Forward() 
    
                dataList = Group(OneOrMore(data))
                simpleArray = (LBRACE + dataList + RBRACE)
    
                propertyName = Word(alphanums+"_-.").setParseAction(self.prependPropertyToken)
                property = dictOf(propertyName + EQ, value)
                properties = Dict(property)
    
                object << (LBRACE + properties + RBRACE)
                value << (data | object | simpleArray)
    
                dataset = properties.ignore(comment)
    
                #parse it
                result = dataset.parseString(inputText)
    
                #turn it into a JSON-like object
                dict = self.convert_to_dict(result.asList())
                return json.dumps(dict)
    
    
    
        def convert_to_dict(self, inputList):
                dict = {}
                for item in inputList:
                        #determine the key and value to be inserted into the dict
                        dictval = None
                        key = None
    
                        if isinstance(item, list):
                                try:
                                        key = item[0].replace("__property__","")
                                        if isinstance(item[1], list):
                                                try:
                                                        if item[1][0].startswith("__property__"):
                                                                dictval = self.convert_to_dict(item)
                                                        else:
                                                                dictval = item[1]
                                                except AttributeError:
                                                        dictval = item[1]
                                        else:
                                                dictval = item[1]
                                except IndexError:
                                        dictval = None
                        #determine whether to insert the value into the key or to merge the value with existing values at this key
                        if key:
                                if key in dict:
                                        if isinstance(dict[key], list):
                                                dict[key].append(dictval)
                                        else:
                                                old = dict[key]
                                                new = [old]
                                                new.append(dictval)
                                                dict[key] = new
                                else:
                                        dict[key] = dictval
                return dict
    
    
    
        def prependPropertyToken(self,t):
                return "__property__" + t[0]