Search code examples
phpregexutf-8preg-match

Find Chinese text in HTML using preg_match


I'm attempting to get the text string from a string of HTML. I would like to capture only the text between tags and skip over any empty tags.

My attempt is current attempt can be found here:
https://regex101.com/r/3Ujmw6/2

  • I can't use \w since I need to capture Chinese characters
  • I would like only text and not a lot of empty results

I have tried:

/>(\X+?)</g

//I will fail on nested tags, it capture the first nested tag
<p><strong>blablab</strong></p>

And this:

/>(\X*?)</g

//Finds me all the string, but also includes loads of empty strings
//for adjacent tags ><

Is there any way to exclude < from \X? Or is there a better way to write this so it returns only the text parts?


Solution

  • Try a regex like

    >(\s*[^\s<][^<]*)
    

    This simply matches all text between > and < that isn't all whitespace. See https://regex101.com/r/3Ujmw6/4.