Search code examples
ctokenize

How to scan input string for token stream


I'm doing a simple Lexical Analyzer C program. What I want to do first is to tokenize the inputted statement. (Example statement: printf1234=---abc)

How will I separate "printf", "1234", "=", "---", and "abc" using strtok()?

Here's my experimental code for this:

#include <stdio.h>
#include <string.h>
#include <conio.h>

void main()
{
    char input_string[100];
    char string_storage[100][100];
    char *token;

    printf("Enter a string: ");
    gets(input_string);

    token = strtok(input_string, " ");
    while(token != NULL)
    {
        printf("%s\n",token);
        //strcpy(input_storage,token);
        token = strtok(NULL, " ");
    }
    getch();
}

Solution

  • As you have understood by now strtok you can't use over here.Because you can't use a specific delimiter set for the input. Or even if you can then it wouldn't be good enough to have a generic tokenizer.

    What you can do is, simply decide first what token would be? That way you will have some set of lexical rules that will decide the tokens. For example 4 digit numbers will be a token, = will be another one and so on. Then you will have a set of rules of getting the tokens. Until this, it was much easier. This problem is already solved. You can apply those solutions yourself.

    This is known as lexical analysis is compiler design. There is no meaning associated in here. As you didn't mention about anything semantical you can as well stop here. lex maybe you can look at this and get some idea. If this is not needed or in this detail you don't need then you have to create some automata which will do that for you. (Regex processing is what you will do).

    An interesting discussion regarding this can be found in Dragon Book. Go through it - if you want to dig deeper.