Assuming a post code is in the form A0A 0AA, or A0 0AA where A is any letter and 0 is any number i have written the following sed script to search a web page for a post code.
s/\(([[:alnum:]]\{2,4\})\) \(([[:alnum:]]\{3\})\)/\1 \2/p
To store the first part (A0A) in the first region and second part (0AA) in the second region. then printing out what is found. However running this is currently not finding any postcodes.
Any ideas? thanks
It's hard to find something right with your regex.
- What are the inner, unescaped parentheses there for? Because they are unescaped, they are literally matched. They serve no purpose, in any case.
- Why are you trying to match two [:alnum:] blocks when your actual pattern requires [:alpha:] in some places and [:digit:] in others?
- Why {2,4}? You want two or three, not two, three or four. What you actually want is either letter-number-letter or letter-number.
- Because you don't specify word boundaries, even if you fix your regex, the first pattern will match A0 at the end of a word and the second pattern will match 0AA at the beginning of the word.
You need to, at minimum
- Drop the inner parentheses
- Change the {2,4} to {2,3}
- Add word boundary matches at the beginning and end of the regex
However, this will still not properly satisfy your requirements. It will match invalid patterns. What you really need to do is
- Drop the inner parentheses
- Change the first pattern to match either [:alpha:][:digit:] or [:alpha:][:digit:][:alpha:] (there are two ways to do this).
- Change the second pattern to match [:digit:][:alpha:][:alpha:]
- Add word boundary matches at the beginning and end of the regex.
I didn't give a concrete example of how to do this because you asked for "any ideas". I'm assuming you want to try and fix this yourself given the right pointers.