Search code examples
ctokenstrtok

Why should strtok() be deprecated?


I hear this from a lot of programmers that the use of strtok maybe deprecated in near future. Some say it is still. Why is it a bad choice? strtok() works great in tokenizing a given string. Does it have to do anything with the time and space complexities? Best link I found on the internet was this. But that doesn't seem to solve my curiousity. Suggest any alternatives if possible.


Solution

  • Why is it a bad choice?

    The fundamental technique for solving problems by programming is to construct abstractions which can be used reliably to solve sub-problems, and then compose solutions to those sub-problems into solutions to larger problems.

    strtok's behaviour works directly against these goals in a variety of ways; it is a poor abstraction that is unreliable because it composes poorly.

    The fundamental problem of tokenization is: given a position in a string, give the position of the end of the token beginning at that position. If strtok did only that, it would be great. It would have a clear abstraction, it would not rely on hidden global state, it would not modify its inputs.

    To see the limitations of strtok, imagine trying to tokenize a language where we wish to separate tokens by spaces, unless the token is enclosed in " ", in which case we wish to apply a different tokenization rule to the contents of the quoted area, and then pick up with the space separation rule after. strtok composes very poorly with itself, and is therefore only useful for the most trivial of tokenization tasks.

    Does it have to do anything with the time and space complexities?

    No.

    Suggest any alternatives if possible.

    Lexers are not hard to write; just write one!

    Bonus points if you write an immutable lexer. An immutable lexer is a little struct that contains a reference to the string being lexed, the current position of the lexer, and any state needed by the lexer. To extract a token you call a "next token" method, pass in the lexer, and you get back the token and a new lexer. The new lexer can then be used to lex the next token, and you discard the previous lexer if you wish.

    The immutable lexer technique is easier to reason about than lexers which modify state. And you can debug them by saving the discarded lexers in a list, and now you have the complete history of tokenization operations open to inspection at once.