Search code examples
javascriptregexword-boundary

Split a string by max characters length, word aware - but without capturing whitespaces


The following regex (taken from here) splits a string by characters length (e.g. 20 characters), while being word-aware (live demo):

\b[\w\s]{20,}?(?=\s)|.+$

This means that if a word should be "cut" in the middle (based on the provided characters length) - then the whole word is taken instead:

const str = "this is an input example of one sentence that contains a bit of words and must be split"

const substringMaxLength = 20;

const regex = new RegExp(`\\b[\\w\\s]{${substringMaxLength},}?(?=\\s)|.+$`, 'g');

const substrings = str.match(regex);

console.log(substrings);

However, as can be seen when running the snippet above, the leading whitespace is taken with each substring. Can it be ignored, so that we'll end up with this?

[
  "this is an input example",
  "of one sentence that",
  "contains a bit of words",
  "and must be split"
]

I tried adding either [^\s], (?:\s), (?!\s) everywhere, but just couldn't achieve it.

How can it be done?


Solution

  • You can require that every match starts with \w -- so for both options of your current regex:

    const str = "this is an input example of one sentence that contains a bit of words and must be split"
    
    const substringMaxLength = 20;
    
    const regex = new RegExp(`\\b\\w(?:[\\w\\s]{${substringMaxLength-1},}?(?=\\s)|.*$)`, 'g');
    
    const substrings = str.match(regex);
    
    console.log(substrings);