I want all the words of a file into a list. The size of the file is 6.3 MB containing around 1 million words only. This is what I implemented, it takes around 3.5 seconds to make the list. Any faster approach?
readfile(FileName) -> {ok,Binary} = file:read_file(FileName),
lists:map(fun(X) -> string:to_lower(binary_to_list(X)) end,(re:split(binary_to_list(Binary),"[^a-zA-Z]"))).
Something using string:tokens/2
will be faster:
readfile(Filename) ->
Words = string:tokens(binary_to_list(Bin), " \t\r\n"),
lists:map(fun(Word) -> string:to_lower(Word) end, Words).
The second argument is the list of characters to split on. If you want to split on other types of control characters refer to the Erlang data_types page for the complete list.
In my simple tests this function was almost 5 times faster. Test both functions on your dataset to verify this approach is faster. Performance will vary based on data.