Search code examples
regexlinuxbashsedquoting

How to get only lines with a single quote using GNU sed in Bash shell?


I'm writing a script to parse a text file (multiple lines). I need to print only lines matching the following pattern:

  1. First character of the line is an Uppercase letter
  2. Second character of the line is a lowercase letter OR a single quote
  3. Third character of the line is a lowercase letter OR a space

Examples of "valid" lines

  • Abcd
  • A'cd
  • Ab c

Attemps with GNU sed 4.2.2 on Linux

I ] First attempt (escaping)

$ html2text foo.html | sed -r "/^([A-Z][a-z\'])/!d"

Produces the following error message:

html2text foo.html | sed -r "/^([A-Z][a-z\'])/date"

sed: -e expression n°1, character 19: extra characters after command

II ] Second attempt (no escaping)

$ html2text foo.html | sed -r "/^([A-Z][a-z'])/!d"

Produces the following error message:

html2text foo.html | sed -r "/^([A-Z][a-z'])/date"

sed: -e expression n°1, character 18: extra characters after command

I'm not quite sure how to deal with single quote "'" within a range. I know that escaping a single quote within a single-quoted sed expression is not supported at all, but here both sed expressions are double-quoted.

Weird thing is that error messages both return ".../date" (first line of error messages) which appear to be a bug or parsing issue ("/!d" flag is misinterpreted)...

Note: html2text convert 'foo.html' to text file. sed -r option stands for Extended regular expression. "[A-Z]" matches a range of characters (square square brackets are not literals here)

Thanks for your help


Solution

  • As pointed by casimir-et-hippolyte using grep is simpler here:

    grep "^[A-Z][a-z'][a-z ]"

    or using sed:

    sed -n "/^[A-Z][a-z'][a-z ]/p"