Search code examples
phpregexquotescpu-wordtext-extraction

Get words and quoted phrases from text as an array


I would like to use regex in php to separate words and phrases out of a string. The phrases would be separated by quotes, both double and single. The regular expression would also have to take in consideration single quotes within words (ie nation's).

Example string:

The nation's economy 'is really' poor, but "might be getting" better.

I would like php to separate this type of string into an array using a regex as follows:

Array
(
    [0] => "The"
    [1] => "nation's"
    [2] => "economy"
    [3] => "is really"
    [4] => "poor"
    [5] => "but"
    [6] => "might be getting"
    [7] => "better"
)

What would the php code be to accomplish this?


Solution

  • Use preg_match_all on the regex:

    (?<![\w'"])(?:['"][^'"]+['"]|[\w']+)(?![\w'"])
    

    Example: https://3v4l.org/vBGY7

    preg_match_all(
      '/(?<![\w\'"])(?:[\'"][^\'"]+[\'"]|[\w\']+)(?![\w\'"])/', 
      "The nation's economy 'is really' poor, but \"might be getting\" better.",
      $matches
    );
     
    print_r($matches[0]);
    

    (Note that this doesn't recognize hy-phe-nat-ed words as it is not specified in the question.)

    Output (containing quote wrappings):

    Array
    (
        [0] => The
        [1] => nation's
        [2] => economy
        [3] => 'is really'
        [4] => poor
        [5] => but
        [6] => "might be getting"
        [7] => better
    )