Search code examples
haskellfunctional-programmingghchaskell-platform

Haskell string tokenizer function


I needed a String tokenizer in Haskell but there is apparently nothing already defined in the Prelude or other modules. There is splitOn in Data.Text, but that's a pain to use because you need to wrap the String to Text.

The tokenizer is not too hard to do so I wrote one (it doesn't handle multiple adjacent delimiters, but it worked well for what I needed it). I feel something like this should be already in the modules somewhere..

This is my version

tokenizer :: Char -> String -> [String]
tokenizer delim str = tokHelper delim str []

tokHelper :: Char -> String -> [String] -> [String]
tokHelper d s acc 
    | null pos  = reverse (pre:acc)
    | otherwise = tokenizer d (tail pos) (pre:acc)
        where (pre, pos) = span (/=d) s

I searched the internet for more solutions and found some discussions, like this blog post.

The last comment (by Mahee on June 10, 2011) is particularly interesting. Why not make a version of the words function more generic to handle this? I tried searching for such a function but found none..

Is there a simpler way to this or is 'tokenizing' a string not a very recurring problem? :)


Solution

  • The split library is what you need. Install with cabal install split, then you have access to a lot of split/tokenizer style functions.

    Some examples from the library:

     > import Data.List.Split
     > splitOn "x" "axbxc"
     ["a","b","c"]
     > splitOn "x" "axbxcx"
     ["a","b","c",""]
     > endBy ";" "foo;bar;baz;"
     ["foo","bar","baz"]
     > splitWhen (<0) [1,3,-4,5,7,-9,0,2]
     [[1,3],[5,7],[0,2]]
     > splitOneOf ";.," "foo,bar;baz.glurk"
     ["foo","bar","baz","glurk"]
     > splitEvery 3 ['a'..'z']
     ["abc","def","ghi","jkl","mno","pqr","stu","vwx","yz"]
    

    The wordsBy function from the same library is a generic version of words like you wanted:

    wordsBy (=='x') "dogxxxcatxbirdxx" == ["dog","cat","bird"]