I'm trying to figure out how to parse a string in this format into a tree like data structure of arbitrary depth. and after that make random sentences.
"{{Hello,Hi,Hey} {world,earth},{Goodbye,farewell} {planet,rock,globe{.,!}}}"
where
, means or
{ means expand
} means collapse up to parent
for example, i want to get output like this:
1) hello world planet.
2) hi earth globe!
3) goodby planet.
and etc.
The input string must be parsed. Since it can contain nested braces, we need a recursive parser. But to begin with, we need a data model to represent the tree structure.
We can have three different types of items in this tree: text, a list representing a sequence and a list representing a choice. Let's derive three classes from this abstract base class:
abstract public class TreeItem
{
public abstract string GetRandomSentence();
}
The TextItem
class simply returns its text as "random sentence":
public class TextItem : TreeItem
{
public TextItem(string text)
{
Text = text;
}
public string Text { get; }
public override string GetRandomSentence()
{
return Text;
}
}
The sequence concatenates the text of its items:
public class SequenceItem : TreeItem
{
public SequenceItem(List<TreeItem> items)
{
Items = items;
}
public List<TreeItem> Items { get; }
public override string GetRandomSentence()
{
var sb = new StringBuilder();
foreach (var item in Items) {
sb.Append(item.GetRandomSentence());
}
return sb.ToString();
}
}
The choice item is the only one using randomness to pick one random item from the list:
public class ChoiceItem : TreeItem
{
private static readonly Random _random = new();
public ChoiceItem(List<TreeItem> items)
{
Items = items;
}
public List<TreeItem> Items { get; }
public override string GetRandomSentence()
{
int index = _random.Next(Items.Count);
return Items[index].GetRandomSentence();
}
}
Note that the sequence and choice items both call GetRandomSentence()
recursively on their items to descend the tree recursively.
This was the easy part. Now lets create a parser.
public class Parser
{
enum Token { Text, LeftBrace, RightBrace, Comma, EndOfString }
int _index;
string _definition;
Token _token;
string _text; // If token is Token.Text;
public TreeItem Parse(string definition)
{
_index = 0;
_definition = definition;
GetToken();
return Choice();
}
private void GetToken()
{
if (_index >= _definition.Length) {
_token = Token.EndOfString;
return;
}
switch (_definition[_index]) {
case '{':
_index++;
_token = Token.LeftBrace;
break;
case '}':
_index++;
_token = Token.RightBrace;
break;
case ',':
_index++;
_token = Token.Comma;
break;
default:
int startIndex = _index;
do {
_index++;
} while (_index < _definition.Length & !"{},".Contains(_definition[_index]));
_text = _definition[startIndex.._index];
_token = Token.Text;
break;
}
}
private TreeItem Choice()
{
var items = new List<TreeItem>();
while (_token != Token.EndOfString && _token != Token.RightBrace) {
items.Add(Sequence());
if (_token == Token.Comma) {
GetToken();
}
}
if (items.Count == 0) {
return new TextItem("");
}
if (items.Count == 1) {
return items[0];
}
return new ChoiceItem(items);
}
private TreeItem Sequence()
{
var items = new List<TreeItem>();
while (true) {
if (_token == Token.Text) {
items.Add(new TextItem(_text));
GetToken();
} else if (_token == Token.LeftBrace) {
GetToken();
items.Add(Choice());
if (_token == Token.RightBrace) {
GetToken();
}
} else {
break;
}
}
if (items.Count == 0) {
return new TextItem("");
}
if (items.Count == 1) {
return items[0];
}
return new SequenceItem(items);
}
}
It consists of a lexer, i.e., a low level mechanism to split the input text into tokens. We have have four kinds of tokens: text, "{", "}" and ",". We represent these tokens as
enum Token { Text, LeftBrace, RightBrace, Comma, EndOfString }
We also have added a EndOfString
token to tell the parser that the end of the input string was reached. When the token is Text
we store this text in the field _text
. The lexer is implemented by the GetToken()
method which has no return value and instead sets the _token
field, to make the current token available in the two parsing methods Choice()
and Sequence()
.
One difficulty is that when we encounter an item, we do not know whether it is a single item or whether it is part of a sequence or a choice. We assume that the whole sentence definition is a choice consisting of sequences, which gives sequences precedence over choices (like "*" has precedence over "+" in math).
Both Choice
and Sequence
gather items in a temporary list. If this list contains only one item, then this item will be returned instead of a choice list or a sequence list.
You can test this parser like this:
const string example = "{{Hello,Hi,Hey} {world,earth},{Goodbye,farewell} {planet,rock,globe{.,!}}}";
var parser = new Parser();
var tree = parser.Parse(example);
for (int i = 0; i < 20; i++) {
Console.WriteLine(tree.GetRandomSentence());
}
The output might look like this:
Goodbye rock
Hi earth
Goodbye globe.
Hey world
Goodbye rock
Hi earth
Hey earth
farewell planet
Goodbye globe.
Hey world
Goodbye planet
Hello world
Hello world
Goodbye planet
Hey earth
farewell globe!
Goodbye globe.
Goodbye globe.
Goodbye planet
farewell rock