Search code examples

In C#, what is the best way to parse this WIKI markup?

I need to take data that I am reading in from a WIKI markup page and store it as a table structure. I am trying to figure out how to properly parse the below markup syntax into some table data structure in C#

Here is an example table:

 || Owner || Action || Status || Comments ||
 | Bill | Fix the lobby | In Progress | This is easy |
 | Joe | Fix the bathroom | In Progress | Plumbing \\
  Electric \\
 Painting \\
 \\ | 
 | Scott | Fix the roof | Complete | This is expensive |

and here is how it comes in directly:

|| Owner|| Action || Status || Comments || | Bill\\ | fix the lobby |In Progress | This is eary| | Joe\\ |fix the bathroom\\ | In progress| plumbing  \\Electric \\Painting \\ \\ | | Scott \\ | fix the roof \\ | Complete | this is expensive| 

So as you can see:

  • The column headers have "||" as the separator
  • A row columns have a separator or "|"
  • A row might span multiple lines (as in the second data row example above) so i would have to keep reading until I hit the same number of "|" (cols) that I have in the header row.

I tried reading in line by line and then concatenating lines that had "\" in between then but that seemed a bit hacky.

I also tried to simply read in as a full string and then just parse by "||" first and then keep reading until I hit the same number of "|" and then go to the next row. This seemed to work but it feel like there might be a more elegant way using regular expressions or something similar.

Can anyone suggest the correct way to parse this data?


  • I have largely replaced the previous answer, due to the fact that the format of the input after your edit is substantially different from the one posted before. This leads to a somewhat different solution.

    Because there are no longer any line breaks after a row, the only way to determine for sure where a row ends, is to require that each row has the same number of columns as the table header. That is at least if you don't want to rely on some potentially fragile white space convention present in the one and only provided example string (i.e. that the row separator is the only | not preceded by a space). Your question at least does not provide this as the specification for a row delimiter.

    The below "parser" provides at least the error handling validity checks that can be derived from your format specification and example string and also allows for tables that have no rows. The comments explain what it is doing in basic steps.

    public class TableParser
        const StringSplitOptions SplitOpts = StringSplitOptions.None;
        const string RowColSep = "|";
        static readonly string[] HeaderColSplit = { "||" };
        static readonly string[] RowColSplit = { RowColSep };
        static readonly string[] MLColSplit = { @"\\" };
        public class TableRow
            public List<string[]> Cells;
        public class Table
            public string[] Header;
            public TableRow[] Rows;
        public static Table Parse(string text)
            // Isolate the header columns and rows remainder.
            var headerSplit = text.Split(HeaderColSplit, SplitOpts);
            Ensure(headerSplit.Length > 1, "At least 1 header column is required in the input");
            // Need to check whether there are any rows.
            var hasRows = headerSplit.Last().IndexOf(RowColSep) >= 0;
            var header = headerSplit.Skip(1)
                .Take(headerSplit.Length - (hasRows ? 2 : 1))
                .Select(c => c.Trim())
            if (!hasRows) // If no rows for this table, we are done.
                return new Table() { Header = header, Rows = new TableRow[0] };
            // Get all row columns from the remainder.
            var rowsCols = headerSplit.Last().Split(RowColSplit, SplitOpts);
            // Require same amount of columns for a row as the header.
            Ensure((rowsCols.Length % (header.Length + 1)) == 1, 
                "The number of row colums does not match the number of header columns");
            var rows = new TableRow[(rowsCols.Length - 1) / (header.Length + 1)];
            // Fill rows by sequentially taking # header column cells 
            for (int ri = 0, start = 1; ri < rows.Length; ri++, start += header.Length + 1)
                rows[ri] = new TableRow() { 
                    Cells = rowsCols.Skip(start).Take(header.Length)
                        .Select(c => c.Split(MLColSplit, SplitOpts).Select(p => p.Trim()).ToArray())
            return new Table { Header = header, Rows = rows };
        private static void Ensure(bool check, string errorMsg)
            if (!check)
                throw new InvalidDataException(errorMsg);

    When used like this:

    public static void Main(params string[] args)
            var wikiLine = @"|| Owner|| Action || Status || Comments || | Bill\\ | fix the lobby |In Progress | This is eary| | Joe\\ |fix the bathroom\\ | In progress| plumbing  \\Electric \\Painting \\ \\ | | Scott \\ | fix the roof \\ | Complete | this is expensive|";
            var table = TableParser.Parse(wikiLine);
            Console.WriteLine(string.Join(", ", table.Header));
            foreach (var r in table.Rows)
                Console.WriteLine(string.Join(", ", r.Cells.Select(c => string.Join(Environment.NewLine + "\t# ", c))));

    It will produce the below output:


    Where "\t# " represents a newline caused by the presence of \\ in the input.