Search code examples
c#compiler-constructionlexer

How to tokenizing a syntax from source code based on operators C#


I'm reading all the lines from a TextBox and I am trying to remove all the whitespace that will be in the list.

I need to be able to tokenize the following expression:

if(x==0)
{
    cout<<x;
} 

into

if
(
x
==
0
)
{
cout
<<
x
;
} 

My code:

public static string[] Tokenize(string sourceCode)
{
    Regex RE = new Regex(@"([\s+\+\-\*\%\,\;\&\|\<\>\=\!\{\}])");
    string[] x = RE.Split(sourceCode);

    var list = new List<string>(x);
    list.Remove(" ");

    for (int m = 0; m < list.Count(); m++)
    {
        Console.WriteLine(list[m]);
    }

    return (RE.Split(sourceCode));
}

My output:

if(x
=

=
0)






{








 

 

 
cout
<

<
x
;







}

How can I split with symbols like == << && and how to remove spaces from the list? Is there a better way of achieving what I want?


Solution

  • I agree to @juharr's comment. But if you really want to use regex, it would be better to use the Match method instead of Split because it allows you to specify the tokens you are looking for instead of the token boundaries:

     Regex RE = new Regex(@"\w+|\(|\)|\++|-+|\*|%|,|;|&+|\|+|<+|>+|=+|!|\{|\}");
     foreach (Match m in RE.Matches(sourceCode))
     {
      Console.WriteLine(m.Value);
     }
    

    Result:

    if
    (
    x
    ==
    0
    )
    {
    cout
    <<
    x
    ;
    }