Search code examples
c#clojures-expression

S-Expressions parsing


I ran into this question earlier today:

Example Input: I ran into Joe and Jill and then we went shopping
Example Output: [TOP [S [S [NP [PRP I]] [VP [VBD ran] [PP [IN into] [NP [NNP Joe] [CC and] [NNP Jill]]]]] [CC and] [S [ADVP [RB then]] [NP [PRP we]] [VP [VBD went] [NP [NN shopping]]]]]]

enter image description here

I was about to suggest simply parsing the expected output (as it looks like an s-expression) into an object (in our case a tree) and then using simple LINQ methods to process it. However, to my surprise, I was unable to find a C# s-expression parser.

The only thing I could think of is using Clojure to parse it since it compiles to the clr, I'm not sure it's a good solution though.

By the way, I don't mind the answer to output of type dynamic. Only answers I've found here were for deserializing into a specific schema.

To sum up my question: I need to deserialize s-expressions in C# (serialization would be nice for future readers of this question)


Solution

  • It looks like you need a data-structure of the form:

    public class SNode
    {
        public String Name { get; set; }
    
        private readonly List<SNode> _Nodes = new List<SNode>();
        public ICollection<SNode> Nodes { get { return _Nodes; } }
    }
    

    A serializer of the form

    public String Serialize(SNode root)
    {
        var sb = new StringBuilder();
        Serialize(root, sb);
        return sb.ToString();
    }
    
    private void Serialize(SNode node, StringBuilder sb)
    {
        sb.Append('(');
    
        sb.Append(node.Name);
    
        foreach (var item in node.Nodes)
            Serialize(item, sb);
    
        sb.Append(" )");
    }
    

    And a de-serializer of the form:

    public SNode Deserialize(String st)
    {
        if (String.IsNullOrWhiteSpace(st))
            return null;
    
        var node = new SNode();
    
        var nodesPos = String.IndexOf('(');
        var endPos = String.LastIndexOf(')');
    
        var childrenString = st.SubString(nodesPos, endPos - nodesPos);
    
        node.Name = st.SubString(1, (nodesPos >= 0 ? nodePos : endPos)).TrimEnd();
    
        var childStrings = new List<string>();
    
        int brackets = 0;
        int startPos = nodesPos;
        for (int pos = nodesPos; pos++; pos < endPos)
        {
            if (st[pos] == '(')
                brackets++;
            else if (st[pos] == ')')
            {
                brackets--;
    
                if (brackets == 0)
                {
                    childStrings.Add(st.SubString(startPos, pos - startPos + 1));
                    startPos = pos + 1;
                }
            }
        }
    
        foreach (var child in childStrings)
        {
            var childNode = Deserialize(this, child);
            if (childNode != null)
                node.Nodes.Add(childNode);
        }
    
        return node;
    }
    

    If haven't tested or even compiled this code, however, this is more or less how it could work.