Search code examples
ocamlocaml-core

Reading all characters in OCaml is too slow


I'm a beginner with OCaml and I want to read lines from a file and then examine all characters in each line. As a dummy example, let's say we want to count the occurrences of the character 'A' in a file.

I tried the following

open Core.Std

let count_a acc string = 
    let rec count_help res stream =
        match Stream.peek stream with
        | None -> res
        | Some char -> Stream.junk stream; if char = 'A' then count_help (res+1) stream else count_help res stream
    in acc + count_help 0 (Stream.of_string string)

let count_a = In_channel.fold_lines stdin ~init:0 ~f:count_a

let () = print_string ((string_of_int count_a)^"\n"

I compile it with

 ocamlfind ocamlc -linkpkg -thread -package core -o solution solution.ml

run it with

$./solution < huge_file.txt

on a a file with one million lines which gives me the following times

real    0m16.337s
user    0m16.302s
sys 0m0.027s

which is 4 times more than my python implementation. I'm fairly sure that it should be possible to make this go faster, but I how should I go about doing this?


Solution

  • To count the number of A chars in a string you can just use String.count function. Indeed, the simpliest solution will be:

    open Core.Std
    
    let () =
      In_channel.input_all stdin |>
      String.count ~f:(fun c -> c = 'A') |>
      printf "we have %d A's\n"
    

    update

    A slightly more complicated (and less memory hungry solution), with [fold_lines] will look like this:

    let () =
      In_channel.fold_lines stdin ~init:0 ~f:(fun n s ->
        n + String.count ~f:(fun c -> c = 'A') s) |>
        printf "we have %d A's\n"
    

    Indeed, it is slower, than the previous one. It takes 7.3 seconds on my 8-year old laptop, to count 'A' in 20-megabyte text file. And 3 seconds on a former solution.

    Also, you can find this post interesting, I hope.