Search code examples
c#parsingprogramming-languageslexical-analysis

C# (My own Programming Language) - How to find PRINT STRING more than once when parsing


So I am currently making my own programming language based off of howCode's programming language in Python, but I simply took an hour or so to attempt to convert it into C#, and it went great, although, when I tell the parse to parse the tokens we have collected, it only parses it once after it finds a PRINT STRING in or tokens, and then just stops,

This is the code for my parser, lexer, my script for the laguage, and the console:

Parser:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace BL
{
    public static class Parser
    {
        public static void Parse(string toks)
        {
            if (toks.Substring(0).Split(':')[0] == "PRINT STRING")
            {
                Console.WriteLine(toks.Substring(toks.IndexOf('\"') + 1).Split('\"')[0]);
            }
        }
    }
}

Lexer:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace BL
{
    public static class Lexer
    {
        public static string tok = "";
        public static string str;
        public static int state = 0;
        public static string tokens = "";

        public static void Lex(string data)
        {
            foreach (char c in data)
            {
                tok += c;

                if (tok == " ")
                {
                    if (state == 0)
                    {
                        tok = "";
                        tokens += " ";
                    }
                    else if (state == 1)
                    {
                        tok = " ";
                    }
                }
                else if (tok == Environment.NewLine)
                {
                    tok = "";
                }
                else if (tok == "PRINT")
                {
                    tokens += "PRINT";
                    tok = "";
                }
                else if (tok == "\"")
                {
                    if (state == 0)
                    {
                        state = 1;
                    }
                    else if (state == 1)
                    {
                        tokens += "STRING:" + str + "\" ";
                        str = "";
                        state = 0;
                        tok = "";
                    }
                }
                else if (state == 1)
                {
                    str += tok;
                    tok = "";
                }
            }

            Parser.Parse(tokens);
        }
    }
}

my Script:

PRINT "HELLO WORLD1" PRINT "HELLO WORLD2"

the Console:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;

namespace BL
{
    class Program
    {
        static string data;

        static void Main(string[] args)
        {
            Console.Title = "Compiler";
            string input = Console.ReadLine();
            Open(input);

            Lexer.Lex(data);

            Console.ReadLine();
        }

        public static void Open(string file)
        {
            data = File.ReadAllText(file);
        }
    }
}

when I print the contents of tokens (in Lexer) I get this:

PRINT STRING:"HELLO WORLD1" PRINT STRING:"HELLO WORLD2"

although when I parse it, it only prints HELLO WORLD1, not HELLO WORLD1 and underneath it HELLO WORLD2, I'm not sure what I should do to get the other PRINT STRING, an obviously since this was a project only I have created, there is no answer online, thank you in advance.


Solution

  • You're attempting to parse the language, which is good, but then you're generating a second programming language as a result. This means your Lex() function will end up needing it's own parse logic to handle the resulting text.

    This is why most of the time this sort of problem is solved, the Lex() function will create a list of tokens for someone else to consume. Generally these tokens are more than just strings, but for many little languages like can get away with a simple list of strings as tokens.

    Since I have a soft spot for toy languages, I've modified your example to follow this flow. It loads the file from user input, then breaks it into individual tokens and uses those tokens to 'run' the program:

    // Parse a list of tokens from Lex()
    static void Parse(List<string> tokens)
    {
        // Run through each token in the list of tokens
        for (int i = 0; i < tokens.Count; i++)
        {
            // And act on the token
            switch (tokens[i])
            {
                case "PRINT":
                    // PRINT prints the next token
                    // Move to the next token first
                    i++;
                    // And dump it out
                    Console.WriteLine(tokens[i]);
                    break;
    
                default:
                    // Anything else is an error, so emit an error
                    Console.WriteLine("ERROR: Unknown token " + tokens[i]);
                    break;
            }
        }
    }
    
    // Parse a source code file, returning a list of tokens
    static List<string> Lex(string data)
    {
        // The current token we're building up
        string current = "";
        // Are we inside of a quoted string?
        bool inQuote = false;
        // The list of tokens to return
        List<string> tokens = new List<string>();
    
        foreach (char c in data)
        {
            if (inQuote)
            {
                switch (c)
                {
                    case '"':
                        // The string literal has ended, go ahead and note 
                        // we're no longer in quote
                        inQuote = false;
                        break;
                    default:
                        // Anything else gets added to the current token
                        current += c;
                        break;
                }
            }
            else
            {
                switch (c)
                {
                    case '"':
                        // This is the start of a string literal, note that
                        // we're in it and move on
                        inQuote = true;
                        break;
                    case ' ':
                    case '\n':
                    case '\r':
                    case '\t':
                        // Tokens are sperated by whitespace, so any whitespace
                        // causes the current token to be added to the list of tokens
                        if (current.Length > 0)
                        {
                            // Only add tokens
                            tokens.Add(current);
                            current = "";
                        }
                        break;
                    default:
                        // Anything else is part of a token, just add it
                        current += c;
                        break;
                }
            }
        }
    
        return tokens;
    }
    
    // Quick demo
    static void Main(string[] args)
    {
        string input = Console.ReadLine();
        string data = File.ReadAllText(input);
    
        List<string> tokens = Lex(data);
        Parse(tokens);
    
        Console.ReadLine();
    }