Search code examples
javascriptregexsplitcpu-word

split string contain html tags


i have this html string:

this simple the<b>html string</b> text test that<b>need</b>to<b>spl</b>it it too

i want to split it and have result array like this :

this simple 
the<b>html string<b>
text test 
that<b>need</b>to<b>spl</b>it
it too

i tried this way :

     var string ='this simple the<b>html string</b> text test that<b>need</b>to<b>spl</b>it it too';
     var regex =  XRegExp('((?:[\\p{L}\\p{Mn}]+|)<\\s*.*?[^>]*>.*?<\/.*?>(?:[\\p{L}\\p{Mn}]+|))', "g");
 
    result = string.split(regex);

it didn't work i don't want split word by word is there way to do it ...


Solution

  • Use

    string.split(/\s*(?<!\S)([^\s<>]+(?:\s+[^\s<>]+)*)(?!\S)\s*/).filter(Boolean);
    

    Capturing group will enable saving the matches as part of the resulting array.

    REGEX EXPLANATION

    NODE                     EXPLANATION
    --------------------------------------------------------------------------------
      \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                               more times (matching the most amount
                               possible))
    --------------------------------------------------------------------------------
      (?<!                     look behind to see if there is not:
    --------------------------------------------------------------------------------
        \S                       non-whitespace (all but \n, \r, \t, \f,
                                 and " ")
    --------------------------------------------------------------------------------
      )                        end of look-behind
    --------------------------------------------------------------------------------
      (                        group and capture to \1:
    --------------------------------------------------------------------------------
        [^\s<>]+                 any character except: whitespace (\n,
                                 \r, \t, \f, and " "), '<', '>' (1 or
                                 more times (matching the most amount
                                 possible))
    --------------------------------------------------------------------------------
        (?:                      group, but do not capture (0 or more
                                 times (matching the most amount
                                 possible)):
    --------------------------------------------------------------------------------
          \s+                      whitespace (\n, \r, \t, \f, and " ")
                                   (1 or more times (matching the most
                                   amount possible))
    --------------------------------------------------------------------------------
          [^\s<>]+                 any character except: whitespace (\n,
                                   \r, \t, \f, and " "), '<', '>' (1 or
                                   more times (matching the most amount
                                   possible))
    --------------------------------------------------------------------------------
        )*                       end of grouping
    --------------------------------------------------------------------------------
      )                        end of \1
    --------------------------------------------------------------------------------
      (?!                      look ahead to see if there is not:
    --------------------------------------------------------------------------------
        \S                       non-whitespace (all but \n, \r, \t, \f,
                                 and " ")
    --------------------------------------------------------------------------------
      )                        end of look-ahead
    --------------------------------------------------------------------------------
      \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                               more times (matching the most amount
                               possible))
    

    JavaScript:

    const string = 'this simple the<b>html string</b> text test that<b>need</b>to<b>spl</b>it it too';
    const regex= /\s*(?<!\S)([^\s<>]+(?:\s+[^\s<>]+)*)(?!\S)\s*/;
    console.log(string.split(regex).filter(Boolean));

    Output:

    [
      "this simple",
      "the<b>html string</b>",
      "text test",
      "that<b>need</b>to<b>spl</b>it",
      "it too"
    ]