Search code examples
regexbashsedscreen-scraping

Sed: Scraping a Range of Numbers


Is it possible to specify a range of numbers (1-31) within where I'm matching for a <strong> tag? The tag in output appears as: <strong>21. Infinite Safari Balls</strong>.

Edited

#!/bin/bash

wget -q -O - 'goo.gl/vfYA94' | \
  sed -En '/<strong>([1-9]|[12][0-9]|3[01])/,/<\/blockquote>/p' | \
  sed -e :a -e 's/<[^>]*>//g;/</N;//ba'

Solution

  • As I understand it, you want to print out the block of lines where the first line has the text <strong>NN. where NN is a number between 1 and 31 and stopping with the next line that contains a </blockquote>. sed does not have a good understanding of numbers but you can achieve the effect that you want with regular expressions:

    wget -q -O - 'goo.gl/vfYA94' | sed -En '/<strong>([1-9]|[12][0-9]|30|31)\./,/<\/blockquote\>/p'
    

    To reduce the number of backslashes in the regular expression, I used the -E option for extended regexes. The -E option is recognized on both Mac OSX and on GNU/Linux although the GNU version of sed only documents the use of -r for this purpose.