Search code examples
regexpreg-matchuppercase

regex with uppercase words and dash


Given text strings such as these:

wikiradio 27/09/2012 - LE QUATTRO GIORNATE DI NAPOLI raccontate da Ida Gribaudi

wikiradio 10/04/2013 - DAG HAMMARSKJOLD raccontato da Susanna Pesenti

I am working at a regular expression to match only the UPPERCASE WORDS of the strings (i.e. "LE QUATTRO GIORNATE DI NAPOLI" and "DAG HAMMARSKJOLD"). My code is this:

$title = $_GET["title"];
if (preg_match_all('/\\b(?=[A-Z])[A-Z\' ]+(?=\\W)/',$title,$match)) {

process matched portion...

It works almost always, but when the $title string includes apostrophe+space or a dash, it doesn't. For example, uppercase words in these two titles are not matched.

wikiradio 11/02/2014 - L'ABBE' PIERRE raccontato da Giovanni Anversa

wikiradio 22/12/2015 - JEAN-MICHEL BASQUIAT raccontato da Costantino D'Orazio

What am I missing?


Solution

  • Something like this may works for you:

    \b[A-Z].*?(?= [a-z])
    

    Regex online demo

    Legenda

        \b         # regex words boundary [1]
        [A-Z]      # any single Uppercase letter
        .*?        # Any char repeatead zero or more in lazy mode
        (?= [a-z]) # matches when the next 2 chars are a space and any single lowercase letter
    
    [1] regex word boundary matches between a regex word char '\w' (also [a-zA-Z0-9_]) 
        and a non word \W ([^a-zA-Z0-9_]) or at start/end of the string 
        (just like '^' and '$')
    

    Code demo on ideone


    Update

    An updated version that works using a white list of chars (we can't know it's all the possible one)

    (?m)\b[A-Z][A-Z '-]*(?= |$)
    

    The online demo of the updated version