Search code examples
perlsedgrepquotations

How can I extract all quotations in a text?


I'm looking for a SimpleGrepSedPerlOrPythonOneLiner that outputs all quotations in a text.


Example 1:

echo “HAL,” noted Frank, “said that everything was going extremely well.” | SimpleGrepSedPerlOrPythonOneLiner

stdout:

"HAL,"
"said that everything was going extremely well.”

Example 2:

cat MicrosoftWindowsXPEula.txt | SimpleGrepSedPerlOrPythonOneLiner

stdout:

"EULA"
"Software"
"Workstation Computer"
"Device"
"DRM"

etc.

(link to the corresponding text).


Solution

  • I like this:

    perl -ne 'print "$_\n" foreach /"((?>[^"\\]|\\+[^"]|\\(?:\\\\)*")*)"/g;'
    

    It's a little verbose, but it handles escaped quotes and backtracking a lot better than the simplest implementation. What it's saying is:

    my $re = qr{
       "               # Begin it with literal quote
       ( 
         (?>           # prevent backtracking once the alternation has been
                       # satisfied. It either agrees or it does not. This expression
                       # only needs one direction, or we fail out of the branch
    
             [^"\\]    # a character that is not a dquote or a backslash
         |   \\+       # OR if a backslash, then any number of backslashes followed by 
             [^"]      # something that is not a quote
         |   \\        # OR again a backslash
             (?>\\\\)* # followed by any number of *pairs* of backslashes (as units)
             "         # and a quote
         )*            # any number of *set* qualifying phrases
      )                # all batched up together
      "                # Ended by a literal quote
    }x;
    

    If you don't need that much power--say it's only likely to be dialog and not structured quotes, then

    /"([^"]*)"/ 
    

    probably works about as well as anything else.