Given text strings such as these:
wikiradio 27/09/2012 - LE QUATTRO GIORNATE DI NAPOLI raccontate da Ida Gribaudi
wikiradio 10/04/2013 - DAG HAMMARSKJOLD raccontato da Susanna Pesenti
I am working at a regular expression to match only the UPPERCASE WORDS of the strings (i.e. "LE QUATTRO GIORNATE DI NAPOLI" and "DAG HAMMARSKJOLD"). My code is this:
$title = $_GET["title"];
if (preg_match_all('/\\b(?=[A-Z])[A-Z\' ]+(?=\\W)/',$title,$match)) {
process matched portion...
It works almost always, but when the $title string includes apostrophe+space or a dash, it doesn't. For example, uppercase words in these two titles are not matched.
wikiradio 11/02/2014 - L'ABBE' PIERRE raccontato da Giovanni Anversa
wikiradio 22/12/2015 - JEAN-MICHEL BASQUIAT raccontato da Costantino D'Orazio
What am I missing?
Something like this may works for you:
\b[A-Z].*?(?= [a-z])
Legenda
\b # regex words boundary [1]
[A-Z] # any single Uppercase letter
.*? # Any char repeatead zero or more in lazy mode
(?= [a-z]) # matches when the next 2 chars are a space and any single lowercase letter
[1] regex word boundary matches between a regex word char '\w' (also [a-zA-Z0-9_])
and a non word \W ([^a-zA-Z0-9_]) or at start/end of the string
(just like '^' and '$')
Code demo on ideone
Update
An updated version that works using a white list of chars (we can't know it's all the possible one)
(?m)\b[A-Z][A-Z '-]*(?= |$)
The online demo of the updated version