Search code examples
erlangfile-read

What would be a faster way to make a list of all the words of a file?


I want all the words of a file into a list. The size of the file is 6.3 MB containing around 1 million words only. This is what I implemented, it takes around 3.5 seconds to make the list. Any faster approach?

readfile(FileName) -> {ok,Binary} = file:read_file(FileName),
                  lists:map(fun(X) -> string:to_lower(binary_to_list(X)) end,(re:split(binary_to_list(Binary),"[^a-zA-Z]"))). 

Solution

  • Something using string:tokens/2 will be faster:

    readfile(Filename) ->
      Words = string:tokens(binary_to_list(Bin), " \t\r\n"),
      lists:map(fun(Word) -> string:to_lower(Word) end, Words).
    

    The second argument is the list of characters to split on. If you want to split on other types of control characters refer to the Erlang data_types page for the complete list.

    In my simple tests this function was almost 5 times faster. Test both functions on your dataset to verify this approach is faster. Performance will vary based on data.