Search code examples
c#regexlookbehindregex-lookarounds

Regex Match NC-Comments in a string with mixed C#-Code


I have a textfile with mixed NC-Code and C#-Code. C#-Code starts with "<#" and ends with "#>". Now I need one regex expression to find all NC-Comments. One problem is that NC-Comments starts with ";" therefore I got some issues to distinguish NC-Comment with ";" of C#-Code.

Is it possible to achieve this with only one regular expression?

; 1. NC-Comment
FUNCT_A;
FUNCT_B;

<# // C#-Code
int temp = 42;
string var = "hello";   // C#-Comment
#>

FUNCT_C ; 2. Comment

<# // C#-Code
for(int i = 0; i <10; i++)
{
    Console.WriteLine(i.ToString());
}
#>  

; 3. Comment
FUNCT_D;

The result of the regex should be {1. NC-Comment, 2. Comment, 3. Comment}

I have played arround with following regular expressions:

1.) (;(.*?)\r?\n) --> Finds all NC-Comments but also C#-Code as comment
2.) (#>.*?<#)|(#>.*) --> Finds all NC-Code except the first NC-Code fragment
3.) #>.+?(?=<#) --> Finds all NC-Code except the first and last NC-Code fragment

One solution could be to push each "<#" to a stack and pop each "#>" from this stack. So if the stack is empty then the current string is NC-Code. Next I have to find out if this string is a NC-Comment.


Solution

  • I rather do it without regex:

    public static List<string> GetNCComments(Stream stream)
    {
        using (StreamReader sr = new StreamReader(stream))
        {
            List<string> result = new List<string>();
            bool inCS = false; // are we in C# code?
            int c;
            while ((c = sr.Read()) != -1)
            {
                if (inCS)
                {
                    switch ((char)c)
                    {
                        case '#':
                            if (sr.Peek() == '>') // end of C# block
                            {
                                sr.Read();
                                inCS = false;
                            }
                            break;
                        case '/':
                            if (sr.Peek() == '/') // a C# comment
                                sr.ReadLine(); // skip the whole comment
                            break;
                    }
                }
                else
                {
                    switch ((char)c)
                    {
                        case '<':
                            if (sr.Peek() == '#') // start of C# block
                            {
                                sr.Read();
                                inCS = true;
                            }
                            break;
                        case ';': // NC comment
                            string comment = sr.ReadLine();
                            if (!string.IsNullOrEmpty(comment))
                                result.Add(comment);
                            break;
                    }
                }
            }
            return result;
        }
    }
    

    Usage:

    var comments = GetNCComments(new FileStream(filePath, FileMode.Open, FileAccess.Read));
    

    The code is simple and self explanatory. This also handles C# comments, but not C# strings. I mean, it works correctly if you have a #> in a C# comment. But does not work if you have the same thing a C# string (incorrectly considers it as the end of C# block). Handling this case is also easy.