Search code examples
scalatextdata-preprocessing

Scala split a 2 words which aren't seperated


I have a corpus with words like, applefruit which isn't separated by any separator which I would like to do. As this can be a non-linear problem. I would like to pass a custom dictionary to split only when a word from the dictionary is a substring of a word in the corpus.

if my dictionary has only apple and 3 words in corpus aaplefruit, applebananafruit, bananafruit. The output should look like apple , fruit apple, bananafruit, bananafruit.

Notice I am not splitting bananafruit, the goal is to make the process faster by just splitting on the text provided in the dictionary. I am using scala 2.x.


Solution

  • You can use regular expressions with split:

    scala> "foobarfoobazfoofoobatbat".split("(?<=foo)|(?=foo)")
    res27: Array[String] = Array(foo, bar, foo, baz, foo, foo, batbat)
    

    Or if your dictionary (and/or strings to split) has more than one word ...

       val rx = wordList.map { w => s"(?<=$w)|(?=$w)" }.mkString("|")
       val result: List[String] = toSplit.flatMap(_.split(rx))