Finding if a string is a substring of another in Sml without library functions

I am trying to write a function that subString : string * string -> int that checks if the first string is a substring of the second and its case sensitive.

I want to return the index starting from 0 if the first string is a substring or -1 if it is not. if it appears multiple times just return the index of the first appearance.

for instance:

subString("bc","abcabc") ===>1
subString("aaa","aaaa") ===>0
subString("bc","ABC") ===>-1

I am having a lot of trouble wrapping my brain around this because I am not too familiar with sml or using strings in sml and I am not supposed to use any built in functions like String.sub.

I can use helper functions though.

all I can think of is to use explode somehow in a helper function and somehow check the lists and then implode them, but how do I get the indexed position?

all I have is

fun subString(s1,s2) =
     if null s2 then ~1
     else if s1 = s2 then 0
     else 1+subString(s1, tl s2);

I am thinking of using a helper function that explodes the strings and then maybe compares the two but I can't figure how to get that to work.

Solution

I am not supposed to use any built in functions like String.sub

What a pity! Since strings have an abstract interface while you with lists have direct access to its primary constructors, [] and ::, you have to use library functions to get anywhere with strings. explode is also a library function. But okay, if your constraint is that you have to convert your string into a list to solve the exercise, so be it.

Given your current code,

fun subString(s1,s2) =
     if null s2 then ~1
     else if s1 = s2 then 0
     else 1+subString(s1, tl s2);

I sense one problem here:

   subString ([#"b",#"c"], [#"a",#"b",#"c",#"d"])
~> if null ([#"a",#"b",#"c",#"d"]) then ... else
   if [#"b",#"c"] = [#"a",#"b",#"c",#"d"] then ... else
   1 + subString([#"b",#"c"], [#"b",#"c",#"d"])

~> 1 + subString([#"b",#"c"], [#"b",#"c",#"d"])
~> 1 + if null ([#"b",#"c",#"d"]) then ... else
       if [#"b",#"c"] = [#"b",#"c",#"d"] then ... else
       1 + subString([#"b",#"c"], [#"c",#"d"])

It seems that the check s1 = s2 is not exactly enough: We should have liked to say that [#"b",#"c"] is a substring of [#"b",#"c",#"d"] because it's a prefix of it, not because it is equivalent. With s1 = s2 you end up checking that something is a valid suffix, not a valid substring. So you need to change s1 = s2 into something smarter.

Perhaps you can build a helper function that determines if one list is a prefix of another and use that here?

As for solving this exercise by explodeing your strings into lists: This is highly inefficient, so much that Standard ML's sister language Ocaml had explode entirely removed from the library:

The functions explode and implode were in older versions of Caml, but we omitted them from OCaml because they encourage inefficient code. It is generally a bad idea to treat a string as a list of characters, and seeing it as an array of characters is a much better fit to the actual implementation.

So first off, String.isSubstring already exists, so this is a solved problem. But if it weren't, and one wanted to write this compositionally, and String.sub isn't cheating (it is accessing a character in a string, comparable to pattern matching the head and tail of a list via x::xs), then let me encourage you to write efficient, composable and functional code:

(* Check that a predicate holds for all (c, i) of s, where
 * s is a string, c is every character in that string, and
 * i is the position of c in s. *)
fun alli s p =
    let val stop = String.size s
        fun go i = i = stop orelse p (String.sub (s, i), i) andalso go (i + 1)
    in go 0 end

(* needle is a prefix of haystack from the start'th index *)
fun isPrefixFrom (needle, haystack, start) =
    String.size needle + start <= String.size haystack andalso
    alli needle (fn (c, i) => String.sub (haystack, i + start) = c)

(* needle is a prefix of haystack if it is from the 0th index *)
fun isPrefix (needle, haystack) =
    isPrefixFrom (needle, haystack, 0)

(* needle is a substring of haystack if is a prefix from any index *)
fun isSubstring (needle, haystack) =
    let fun go i =
            String.size needle + i <= String.size haystack andalso
            (isPrefixFrom (needle, haystack, i) orelse go (i + 1))
    in go 0 end

The general idea here, which you can re-use when building an isSubstring that uses list recursion rather than string index recursion, is to build the algorithm abstractly: needle being a substring of haystack can be defined in simpler terms by needle being the prefix of haystack counting from any valid position in haystack (of course not such that it exceeds haystack). And determining if something is a prefix is much easier, even easier with list recursion!

This suggestion would leave you with a template,

fun isPrefix ([], _) = ...
  | isPrefix (_, []) = ...
  | isPrefix (x::xs, y::ys) = ...

fun isSubstring ([], _) = ...
  | isSubstring (xs, ys) = ... isPrefix ... orelse ...

As for optimizing the string index recursive solution, you could avoid the double bounds checking in both isPrefixFrom and in isSubstring by making isPrefixFrom a local function only accessible to isPrefix and isSubstring; otherwise it will be unsafe.

Testing this,

- isSubstring ("bc", "bc");
> val it = true : bool
- isSubstring ("bc", "bcd");
> val it = true : bool
- isSubstring ("bc", "abc");
> val it = true : bool
- isSubstring ("bc", "abcd");
> val it = true : bool
- isSubstring ("bc", "");
> val it = false : bool