Search code examples
regexstringparsinganalysis

Parsing tags in string


I'm trying to parse a string with custom tags like this

[color value=0x000000]This house is [wave][color value=0xFF0000]haunted[/color][/wave]. 
I've heard about ghosts [shake]screaming[/shake] here after midnight.[/color]

I've figured out what regexps to use

/\[color value=(.*?)\](.*?)\[\/color\]/gs
/\[wave\](.*?)\[\/wave\]/gs
/\[shake\](.*?)\[\/shake\]/gs

But the thing is - I need to get correct ranges (startIndex, endIndex) of those groups in result string so I could apply them correctly. And that's where I feel completely lost, because everytime I replace tags there's always a chance for indexes to mess up. It gets espesically hard for nested tags.

So input is a string

[color value=0x000000]This house is [wave][color value=0xFF0000]haunted[/color][/wave]. 
I've heard about ghosts [shake]screaming[/shake] here after midnight.[/color]

And in output I want to get something like

Apply color 0x000000 from 0 to 75
Apply wave from 14 to 20
Apply color 0xFF0000 from 14 to 20
Apply shake from 46 to 51

Notice that's indices match to result string.

How do I parse it?


Solution

  • Unfortunately, I'm not familiar with ActionScript, but this C# code shows one solution using regular expressions. Rather than match specific tags, I used a regular expression that can match any tag. And instead of trying to make a regular expression that matches the whole start and end tag including the text in between (which I think is impossible with nested tags), I made the regular expression just match a start OR end tag, then did some extra processing to match up the start and end tags and remove them from the string keeping the essential information.

    using System;
    using System.Collections.Generic;
    using System.Text.RegularExpressions;
    
    class Program
    {
       static void Main(string[] args)
       {
          string data = "[color value=0x000000]This house is [wave][color value=0xFF0000]haunted[/color][/wave]. " +
                        "I've heard about ghosts [shake]screaming[/shake] here after midnight.[/color]";
    
          ParsedData result = ParseData(data);
          foreach (TagInfo t in result.tags)
          {
             if (string.IsNullOrEmpty(t.attributeName))
             {
                Console.WriteLine("Apply {0} from {1} to {2}", t.name, t.start, t.start + t.length - 1);
             }
             else
             {
                Console.WriteLine("Apply {0} {1}={2} from {3} to {4}", t.name, t.attributeName, t.attributeValue, t.start, t.start + t.length - 1);
             }
             Console.WriteLine(result.data);
             Console.WriteLine("{0}{1}\n", new string(' ', t.start), new string('-', t.length));
          }
       }
    
       static ParsedData ParseData(string data)
       {
          List<TagInfo> tagList = new List<TagInfo>();
          Regex reTag = new Regex(@"\[(\w+)(\s+(\w+)\s*=\s*([^\]]+))?\]|\[(\/\w+)\]");
          Match m = reTag.Match(data);
    
          // Phase 1 - Collect all the start and end tags, noting their position in the original data string
          while (m.Success)
          {
             if (m.Groups[1].Success) // Matched a start tag
             {
                tagList.Add(new TagInfo()
                {
                   name = m.Groups[1].Value,
                   attributeName = m.Groups[3].Value,
                   attributeValue = m.Groups[4].Value,
                   tagLength = m.Groups[0].Length,
                   start = m.Groups[0].Index
                });
             }
             else if (m.Groups[5].Success)
             {
                tagList.Add(new TagInfo()
                {
                   name = m.Groups[5].Value,
                   tagLength = m.Groups[0].Length,
                   start = m.Groups[0].Index
                });
             }
             m = m.NextMatch();
          }
    
          // Phase 2 - match end tags to start tags
          List<TagInfo> unmatched = new List<TagInfo>();
          foreach (TagInfo t in tagList)
          {
             if (t.name.StartsWith("/"))
             {
                for (int i = unmatched.Count - 1; i >= 0; i--)
                {
                   if (unmatched[i].name == t.name.Substring(1))
                   {
                      t.otherEnd = unmatched[i];
                      unmatched[i].otherEnd = t;
                      unmatched.Remove(unmatched[i]);
                      break;
                   }
                }
             }
             else
             {
                unmatched.Add(t);
             }
          }
    
          int subtractLength = 0;
          // Phase 3 - Remove tags from the string, updating start positions and calculating length in the process
          foreach (TagInfo t in tagList.ToArray())
          {
             t.start -= subtractLength;
             // If this is an end tag, calculate the length for the corresponding start tag,
             // and remove the end tag from the tag list.
             if (t.otherEnd.start < t.start)
             {
                t.otherEnd.length = t.start - t.otherEnd.start;
                tagList.Remove(t);
             }
             // Keep track of how many characters in tags have been removed from the string so far
             subtractLength += t.tagLength;
          }
    
          return new ParsedData()
          {
             data = reTag.Replace(data, string.Empty),
             tags = tagList.ToArray()
          };
       }
    
       class TagInfo
       {
          public int start;
          public int length;
          public int tagLength;
          public string name;
          public string attributeName;
          public string attributeValue;
          public TagInfo otherEnd;
       }
    
       class ParsedData
       {
          public string data;
          public TagInfo[] tags;
       }
    }
    

    The output is:

    Apply color value=0x000000 from 0 to 76
    This house is haunted. I've heard about ghosts screaming here after midnight.
    -----------------------------------------------------------------------------
    
    Apply wave from 14 to 20
    This house is haunted. I've heard about ghosts screaming here after midnight.
                  -------
    
    Apply color value=0xFF0000 from 14 to 20
    This house is haunted. I've heard about ghosts screaming here after midnight.
                  -------
    
    Apply shake from 47 to 55
    This house is haunted. I've heard about ghosts screaming here after midnight.
                                                   ---------