Search code examples
ocaml

Iterating over split string in OCaml


Let's say I have a string:

"ab bc cdv gf
ed    aqb ahf sd
abcdef

I want to a) Split it by ' ' and/or '\r\n', '\t' b) Iterate over newly created list of these substrings, split by separators and match each of them to some criteria (for example, only choose words starting with 'a', aka ["ab", "ahf", "abcdef"]

Note: also we can't use Str or any other additional libraries.

I came up with some sort of this code:

let f g =
  String.split_on_char ' ' g
  |> List.iter (fun x -> x);;

Obviously though, it shows an error. And even if it worked, it wouldn't have split out the '\r\n'. Instead of List.iter I could have used List.map (fun x -> x), but I will just get the split (by ' ' character only) list of substrings. So now another question: how can I use

"match (something?) with
| ..." 

in this case? I see no way in adding match into the code above. Do we use the reverse |> and List.iter in this case or is there another way I'm not aware of?


Solution

  • Simple approach: let's just keep splitting on whitespace characters we want to split on, use List.concat_map to maintain a "flat" list, and then reject empty lists.

    let s = "ab bc cdv gf ed aqb ahf sd abc\r\ndef" in
    let split = String.split_on_char in
    s
    |> split ' ' 
    |> List.concat_map (split '\n')
    |> List.concat_map (split '\r') 
    |> List.filter ((<>) "")
    
    (* Result:
     * ["ab"; "bc"; "cdv"; "gf"; "ed"; "aqb"; "ahf"; "sd"; "abc"; "def"] 
     *)
    

    You might also use your regular expression library of choice and split on \s+, but apparently that isn't allowed.

    You could also break this out into a function using a left fold, and supply the characters to split on as a string.

    let split_on delims str =
      delims 
      |> String.to_seq
      |> Seq.fold_left 
           (fun acc delim -> 
              List.concat_map (String.split_on_char delim) acc) 
           [str]
      |> List.filter ((<>) "")
    
    utop # split_on " \t\r\n" s;;
    - : string list =
    ["ab"; "bc"; "cdv"; "gf"; "ed"; "aqb"; "ahf"; "sd"; "abc"; "def"]