Search code examples
c#algorithmdata-structurestreetext-parsing

How Can I Parse String and Get Random Sentences in C#?


I'm trying to figure out how to parse a string in this format into a tree like data structure of arbitrary depth. and after that make random sentences.

"{{Hello,Hi,Hey} {world,earth},{Goodbye,farewell} {planet,rock,globe{.,!}}}"

where

, means  or
{ means expand
} means collapse up to parent

for example, i want to get output like this:

1) hello world planet.
2) hi earth globe!
3) goodby planet.
and etc. 

Solution

  • The input string must be parsed. Since it can contain nested braces, we need a recursive parser. But to begin with, we need a data model to represent the tree structure.

    We can have three different types of items in this tree: text, a list representing a sequence and a list representing a choice. Let's derive three classes from this abstract base class:

    abstract public class TreeItem
    {
        public abstract string GetRandomSentence();
    }
    

    The TextItem class simply returns its text as "random sentence":

    public class TextItem : TreeItem
    {
        public TextItem(string text)
        {
            Text = text;
        }
    
        public string Text { get; }
    
        public override string GetRandomSentence()
        {
            return Text;
        }
    }
    

    The sequence concatenates the text of its items:

    public class SequenceItem : TreeItem
    {
        public SequenceItem(List<TreeItem> items)
        {
            Items = items;
        }
    
        public List<TreeItem> Items { get; }
    
        public override string GetRandomSentence()
        {
            var sb = new StringBuilder();
            foreach (var item in Items) {
                sb.Append(item.GetRandomSentence());
            }
            return sb.ToString();
        }
    }
    

    The choice item is the only one using randomness to pick one random item from the list:

    public class ChoiceItem : TreeItem
    {
        private static readonly Random _random = new();
    
        public ChoiceItem(List<TreeItem> items)
        {
            Items = items;
        }
    
        public List<TreeItem> Items { get; }
    
        public override string GetRandomSentence()
        {
            int index = _random.Next(Items.Count);
            return Items[index].GetRandomSentence();
        }
    }
    

    Note that the sequence and choice items both call GetRandomSentence() recursively on their items to descend the tree recursively.


    This was the easy part. Now lets create a parser.

    public class Parser
    {
        enum Token { Text, LeftBrace, RightBrace, Comma, EndOfString }
    
        int _index;
        string _definition;
        Token _token;
        string _text; // If token is Token.Text;
    
        public TreeItem Parse(string definition)
        {
            _index = 0;
            _definition = definition;
            GetToken();
            return Choice();
        }
    
        private void GetToken()
        {
            if (_index >= _definition.Length) {
                _token = Token.EndOfString;
                return;
            }
            switch (_definition[_index]) {
                case '{':
                    _index++;
                    _token = Token.LeftBrace;
                    break;
                case '}':
                    _index++;
                    _token = Token.RightBrace;
                    break;
                case ',':
                    _index++;
                    _token = Token.Comma;
                    break;
                default:
                    int startIndex = _index;
                    do {
                        _index++;
                    } while (_index < _definition.Length & !"{},".Contains(_definition[_index]));
                    _text = _definition[startIndex.._index];
                    _token = Token.Text;
                    break;
            }
        }
    
        private TreeItem Choice()
        {
            var items = new List<TreeItem>();
            while (_token != Token.EndOfString && _token != Token.RightBrace) {
                items.Add(Sequence());
                if (_token == Token.Comma) {
                    GetToken();
                }
            }
            if (items.Count == 0) {
                return new TextItem("");
            }
            if (items.Count == 1) {
                return items[0];
            }
            return new ChoiceItem(items);
        }
    
        private TreeItem Sequence()
        {
            var items = new List<TreeItem>();
            while (true) {
                if (_token == Token.Text) {
                    items.Add(new TextItem(_text));
                    GetToken();
                } else if (_token == Token.LeftBrace) {
                    GetToken();
                    items.Add(Choice());
                    if (_token == Token.RightBrace) {
                        GetToken();
                    }
                } else {
                    break;
                }
            }
            if (items.Count == 0) {
                return new TextItem("");
            }
            if (items.Count == 1) {
                return items[0];
            }
            return new SequenceItem(items);
        }
    }
    

    It consists of a lexer, i.e., a low level mechanism to split the input text into tokens. We have have four kinds of tokens: text, "{", "}" and ",". We represent these tokens as

    enum Token { Text, LeftBrace, RightBrace, Comma, EndOfString }
    

    We also have added a EndOfString token to tell the parser that the end of the input string was reached. When the token is Text we store this text in the field _text. The lexer is implemented by the GetToken() method which has no return value and instead sets the _token field, to make the current token available in the two parsing methods Choice() and Sequence().

    One difficulty is that when we encounter an item, we do not know whether it is a single item or whether it is part of a sequence or a choice. We assume that the whole sentence definition is a choice consisting of sequences, which gives sequences precedence over choices (like "*" has precedence over "+" in math).

    Both Choice and Sequence gather items in a temporary list. If this list contains only one item, then this item will be returned instead of a choice list or a sequence list.


    You can test this parser like this:

    const string example = "{{Hello,Hi,Hey} {world,earth},{Goodbye,farewell} {planet,rock,globe{.,!}}}";
    var parser = new Parser();
    var tree = parser.Parse(example);
    for (int i = 0; i < 20; i++) {
        Console.WriteLine(tree.GetRandomSentence());
    }
    

    The output might look like this:

    Goodbye rock
    Hi earth
    Goodbye globe.
    Hey world
    Goodbye rock
    Hi earth
    Hey earth
    farewell planet
    Goodbye globe.
    Hey world
    Goodbye planet
    Hello world
    Hello world
    Goodbye planet
    Hey earth
    farewell globe!
    Goodbye globe.
    Goodbye globe.
    Goodbye planet
    farewell rock