Search code examples
regexextractphrases

I want to extract multiple instances of alt text with regex but not sure how


I was using this (?<=alt)[\w\s\,\/\(\)\.]* to extract the first alt text. This is great but there are multiple alt texts that I would like to extract. I am using regex inside visual web ripper

The code I am extracting from is

<DIV id=ctl00_ContentRightColumn_CustomFunctionalityFieldControl1_ctl00_ctl00_woodFeatures class="woodFeaturesPanel woodFeaturesPanelSingle" sizcache="23614" sizset="0"><H2>Features:</H2>  <DIV sizcache="23614" sizset="0">  <UL sizcache="23614" sizset="0">  <LI sizcache="23386" sizset="0"><IMG alt="Information board at site" src="/PublishingImages/icon_infoboard.gif">  <LI sizcache="20558" sizset="0"><IMG alt="Parking nearby" src="/PublishingImages/icon_carparknear.gif">  <LI sizcache="23614" sizset="0"><IMG alt=Grassland src="/PublishingImages/icon_grassland.giF">  <LI sizcache="17694" sizset="0"><IMG alt="Is woodland creation site" src="/PublishingImages/icon_woodlandcreation.gif">  <LI sizcache="21680" sizset="0"><IMG alt="Mainly broadleaved woodland" src="/PublishingImages/icon_mainlybroadleaved.gif">  <LI sizcache="20704" sizset="0"><IMG alt="Mainly young woodland" src="/PublishingImages/icon_mainlyyoung.gif">  <LI>  <LI></LI></UL></DIV></DIV>

Solution

  • Without the language this is difficult to say, but using memory patterns you can capture what you need:

    /alt=(\w\S*|"([^"]*)")/
    

    Using preg_match_all() it gives the following results:

    Array
    (
        [0] => Array
            (
                [0] => alt="Information board at site"
                [1] => alt="Parking nearby"
                [2] => alt=Grassland
                [3] => alt="Is woodland creation site"
                [4] => alt="Mainly broadleaved woodland"
                [5] => alt="Mainly young woodland"
            )
    
        [1] => Array
            (
                [0] => "Information board at site"
                [1] => "Parking nearby"
                [2] => Grassland
                [3] => "Is woodland creation site"
                [4] => "Mainly broadleaved woodland"
                [5] => "Mainly young woodland"
            )
    
        [2] => Array
            (
                [0] => Information board at site
                [1] => Parking nearby
                [2] =>
                [3] => Is woodland creation site
                [4] => Mainly broadleaved woodland
                [5] => Mainly young woodland
            )
    
    )
    

    The second memory pattern is for double quote enclosed strings; if empty, you should look at the first memory pattern instead.