Search code examples
htmlcssapplescriptbbedit

Remove HTML tag associated with a class


I am forcing myself to learn how to script solely in AppleScript but I am currently facing an issue with trying to remove a particular tag with a class. I've tried to find solid documentation and examples but at this time it seems to be very limited.

Here is the HTML I have:

<p>Bacon ipsum dolor amet pork chop landjaeger short ribs boudin short loin jowl <span class="foo">shoulder</span> biltong shankle capicola drumstick pork loin rump spare ribs ham hock. <span class="bar">Pig brisket</span> jowl ham pastrami <span class="foo">jerky</span> strip steak bacon doner. Short loin leberkas jowl, filet mignon turducken chicken ribeye shank tail swine strip steak pork loin sausage. Frankfurter ground round porchetta, pork short ribs jowl alcatra flank sausage.</p>

What I am trying to do is remove a particular class, so it would remove <span class="foo">, result:

<p>Bacon ipsum dolor amet pork chop landjaeger short ribs boudin short loin jowl shoulder biltong shankle capicola drumstick pork loin rump spare ribs ham hock. <span class="bar">Pig brisket</span> jowl ham pastrami jerky strip steak bacon doner. Short loin leberkas jowl, filet mignon turducken chicken ribeye shank tail swine strip steak pork loin sausage. Frankfurter ground round porchetta, pork short ribs jowl alcatra flank sausage.</p>

I know how to do this with do shell script and through the terminal but I am wanting to learn what is available through AppleScript's dictionary.

In research I was able to find a way to parse all HTML tags with:

on removeMarkupFromText(theText)
    set tagDetected to false
    set theCleanText to ""
    repeat with a from 1 to length of theText
        set theCurrentCharacter to character a of theText
        if theCurrentCharacter is "<" then
            set tagDetected to true
        else if theCurrentCharacter is ">" then
            set tagDetected to false
        else if tagDetected is false then
            set theCleanText to theCleanText & theCurrentCharacter as string
        end if
    end repeat
    return theCleanText
end removeMarkupFromText

but that removes all HTML tags and that is not what I want. Searching SO I was able to find how to extract between tags with Parsing HTML source code using AppleScript but I'm not looking to parse the file.

I am familiar with BBEdit's Balance Tags known as Balance in the drop down but when I run:

tell application "BBEdit"
    activate
    find "<span class=\"foo\">" searching in text 1 of text document "test.html" options {search mode:grep, wrap around:true} with selecting match
    balance tags
end tell

it turns greedy and grabs the entire line between the first tag to the second last closing tag with text in between instead of isolating itself to the first tag with it's text.

Further research in the dictionary under tag I did run across find tag which I could do: set spanTarget to (find tag "span" start_offset counter) then target the tag with the class |class| of attributes of tag of spanTarget and use balance tags but I am still running into the same issue as before.

So in pure AppleScript how can I remove a tag associated with a class without it being greedy?


Solution

  • I believe Ron's answer is a good approach, but if you don't want to use regular expressions this can be achieved with the code below. I wasn't going to post it after seeing Ron had answered, but I had already created it so I figured I would at least give you a second option since you are trying to learn.

    on run
        set theHTML to "<p>Bacon ipsum dolor amet pork chop landjaeger short ribs boudin short loin jowl <span class=\"foo\">shoulder</span> biltong shankle capicola drumstick pork loin rump spare ribs ham hock. <span class=\"bar\">Pig brisket</span> jowl ham pastrami <span class=\"foo\">jerky</span> strip steak bacon doner. Short loin leberkas jowl, filet mignon turducken chicken ribeye shank tail swine strip steak pork loin sausage. Frankfurter ground round porchetta, pork short ribs jowl alcatra flank sausage.</p>" 
        set theHTML to removeTag(theHTML, "<span class=\"foo\">", "</span>")
    end run
    
    on removeTag(theText, startTag, endTag)
        if theText contains startTag then
            set AppleScript's text item delimiters to {""}
            set AppleScript's text item delimiters to startTag
            set tempText to text items of (theText as string)
            set AppleScript's text item delimiters to {""}
    
            set middleText to item 2 of tempText as string
            if middleText contains endTag then
                set AppleScript's text item delimiters to endTag
                set tempText2 to text items of (middleText as string)
                set AppleScript's text item delimiters to {""}
                set newString to implode(tempText2, endTag)
                set item 2 of tempText to newString
            end if
            set newString to implode(tempText, startTag)
            removeTag(newString, startTag, endTag) -- recursive
        else
            return theText
        end if
    end removeTag
    
    on implode(parts, tag)
        set newString to items 1 thru 2 of parts as string
        if (count of parts) > 2 then
            set newList to {newString, items 3 thru -1 of parts}
            set AppleScript's text item delimiters to tag
            set newString to (newList as string)
            set AppleScript's text item delimiters to {""}
        end if
        return newString
    end implode