I am forcing myself to learn how to script solely in AppleScript but I am currently facing an issue with trying to remove a particular tag with a class. I've tried to find solid documentation and examples but at this time it seems to be very limited.
Here is the HTML I have:
<p>Bacon ipsum dolor amet pork chop landjaeger short ribs boudin short loin jowl <span class="foo">shoulder</span> biltong shankle capicola drumstick pork loin rump spare ribs ham hock. <span class="bar">Pig brisket</span> jowl ham pastrami <span class="foo">jerky</span> strip steak bacon doner. Short loin leberkas jowl, filet mignon turducken chicken ribeye shank tail swine strip steak pork loin sausage. Frankfurter ground round porchetta, pork short ribs jowl alcatra flank sausage.</p>
What I am trying to do is remove a particular class, so it would remove <span class="foo">
, result:
<p>Bacon ipsum dolor amet pork chop landjaeger short ribs boudin short loin jowl shoulder biltong shankle capicola drumstick pork loin rump spare ribs ham hock. <span class="bar">Pig brisket</span> jowl ham pastrami jerky strip steak bacon doner. Short loin leberkas jowl, filet mignon turducken chicken ribeye shank tail swine strip steak pork loin sausage. Frankfurter ground round porchetta, pork short ribs jowl alcatra flank sausage.</p>
I know how to do this with do shell script
and through the terminal but I am wanting to learn what is available through AppleScript's dictionary.
In research I was able to find a way to parse all HTML tags with:
on removeMarkupFromText(theText)
set tagDetected to false
set theCleanText to ""
repeat with a from 1 to length of theText
set theCurrentCharacter to character a of theText
if theCurrentCharacter is "<" then
set tagDetected to true
else if theCurrentCharacter is ">" then
set tagDetected to false
else if tagDetected is false then
set theCleanText to theCleanText & theCurrentCharacter as string
end if
end repeat
return theCleanText
end removeMarkupFromText
but that removes all HTML tags and that is not what I want. Searching SO I was able to find how to extract between tags with Parsing HTML source code using AppleScript but I'm not looking to parse the file.
I am familiar with BBEdit's Balance Tags
known as Balance
in the drop down but when I run:
tell application "BBEdit"
activate
find "<span class=\"foo\">" searching in text 1 of text document "test.html" options {search mode:grep, wrap around:true} with selecting match
balance tags
end tell
it turns greedy and grabs the entire line between the first tag to the second last closing tag with text in between instead of isolating itself to the first tag with it's text.
Further research in the dictionary under tag
I did run across find tag
which I could do: set spanTarget to (find tag "span" start_offset counter)
then target the tag with the class |class| of attributes of tag of spanTarget
and use balance tags
but I am still running into the same issue as before.
So in pure AppleScript how can I remove a tag associated with a class without it being greedy?
I believe Ron's answer is a good approach, but if you don't want to use regular expressions this can be achieved with the code below. I wasn't going to post it after seeing Ron had answered, but I had already created it so I figured I would at least give you a second option since you are trying to learn.
on run
set theHTML to "<p>Bacon ipsum dolor amet pork chop landjaeger short ribs boudin short loin jowl <span class=\"foo\">shoulder</span> biltong shankle capicola drumstick pork loin rump spare ribs ham hock. <span class=\"bar\">Pig brisket</span> jowl ham pastrami <span class=\"foo\">jerky</span> strip steak bacon doner. Short loin leberkas jowl, filet mignon turducken chicken ribeye shank tail swine strip steak pork loin sausage. Frankfurter ground round porchetta, pork short ribs jowl alcatra flank sausage.</p>"
set theHTML to removeTag(theHTML, "<span class=\"foo\">", "</span>")
end run
on removeTag(theText, startTag, endTag)
if theText contains startTag then
set AppleScript's text item delimiters to {""}
set AppleScript's text item delimiters to startTag
set tempText to text items of (theText as string)
set AppleScript's text item delimiters to {""}
set middleText to item 2 of tempText as string
if middleText contains endTag then
set AppleScript's text item delimiters to endTag
set tempText2 to text items of (middleText as string)
set AppleScript's text item delimiters to {""}
set newString to implode(tempText2, endTag)
set item 2 of tempText to newString
end if
set newString to implode(tempText, startTag)
removeTag(newString, startTag, endTag) -- recursive
else
return theText
end if
end removeTag
on implode(parts, tag)
set newString to items 1 thru 2 of parts as string
if (count of parts) > 2 then
set newList to {newString, items 3 thru -1 of parts}
set AppleScript's text item delimiters to tag
set newString to (newList as string)
set AppleScript's text item delimiters to {""}
end if
return newString
end implode