I am collecting tweets from twitter using erlang and I am trying to save only the hashtags to a database. However when I'm converting the bitstrings to list-strings all the non-latin-letter tweets converts to strange symbols. Is there any way to check if a string is only containing alphanumeric characters in erlang?
The easiest way is to use regular expressions.
StringAlphanum = "1234abcZXYM".
StringNotAlphanum = "1ZXYMÄ#kMp&?".
re:run(StringAlphanum, "^[0-9A-Za-z]+$").
>> {match,[{0,11}]}
re:run(StringNotAlphanum, "^[0-9A-Za-z]+$").
>> nomatch
You can easily make a function out of it...
isAlphaNum(String) ->
case re:run(String, "^[0-9A-Za-z]+$") of
{match, _} -> true;
nomatch -> false
end.
But, in my opinion, the better way would be to solve the underlying Problem, the correct interpretation of unicode binary strings.
If you want to represent unicode-characters correctly, do not use binary_to_list
. Use the unicode-module instead. Unicode-binary strings can not be interpreted naiveley as binary, the UTF-8 character encoding for example has some special constraints that prevent this. For example: the most significant bit in the first character determines, if it is a multi-byte character.
I took the following example from this site, lets define a UTF8-String:
Utf8String = <<195, 164, 105, 116, 105>>.
Interpreted naiveley as binary it yields:
binary_to_list(Utf8String).
"äiti"
Interpreted with unicode-support:
unicode:characters_to_list(Utf8String, utf8).
"äiti"