Search code examples
stringerlangalphanumeric

Is there a way to check if a string is alphanumeric in erlang


I am collecting tweets from twitter using erlang and I am trying to save only the hashtags to a database. However when I'm converting the bitstrings to list-strings all the non-latin-letter tweets converts to strange symbols. Is there any way to check if a string is only containing alphanumeric characters in erlang?


Solution

  • The easiest way is to use regular expressions.

    StringAlphanum = "1234abcZXYM".
    StringNotAlphanum = "1ZXYMÄ#kMp&?".
    
    re:run(StringAlphanum, "^[0-9A-Za-z]+$").
    >> {match,[{0,11}]}
    
    re:run(StringNotAlphanum, "^[0-9A-Za-z]+$").
    >> nomatch
    

    You can easily make a function out of it...

    isAlphaNum(String) -> 
        case re:run(String, "^[0-9A-Za-z]+$") of
            {match, _} -> true;
            nomatch    -> false
        end.
    

    But, in my opinion, the better way would be to solve the underlying Problem, the correct interpretation of unicode binary strings.

    If you want to represent unicode-characters correctly, do not use binary_to_list. Use the unicode-module instead. Unicode-binary strings can not be interpreted naiveley as binary, the UTF-8 character encoding for example has some special constraints that prevent this. For example: the most significant bit in the first character determines, if it is a multi-byte character.

    I took the following example from this site, lets define a UTF8-String:

    Utf8String = <<195, 164, 105, 116, 105>>.
    

    Interpreted naiveley as binary it yields:

    binary_to_list(Utf8String).
    "äiti"
    

    Interpreted with unicode-support:

    unicode:characters_to_list(Utf8String, utf8).
    "äiti"