Search code examples
htmlparsingapplescriptdelimiterautomator

Parsing HTML source code using AppleScript


I'm trying to parse an HTML file which I have converted to a TXT file inside of Automator.

I previously downloaded the HTML file from a website using Automator, and I am now struggling to parse the source code.

Preferably, I want to take the information of just the table and I need to repeat this action for 1800 different HTML files.

Here is an example of the source code:

</head>
<body>
<div id="header">
    <div class="wrapper">
        <span class="access">
        <div id="fb-root"></div>


    <span class="access">
     Gold Account: <a class="upgrade" title="Account Details" href="http://www.hedge-professionals.com/account-details.html" >Active </a>       Logged in as Edward&nbsp;&nbsp; | &nbsp;&nbsp;<a href="javascript:void(0);" onclick='logout()' class="logout">Sign Out</a>

    </span>
                                    </span>
    </div><!-- /wrapper -->
</div><!-- /header -->

<div id="masthead">
    <div class="wrapper">   
        <a href="http://www.hedge-professionals.com" ><img src="http://www.hedge-professionals.com/images/hedgep_logo_white.png" alt="Hedge Professionals Database" width="333" height="46" class="logo" border="0" /></a>
        <div id="navigation">
            <ul>
<li ><a href='http://www.hedge-professionals.com/dashboard.html' >Dashboard</a></li>    <li ><a href='http://www.hedge-professionals.com/people.html'class='current' >People</a></li><li ><a href='http://www.hedge-professionals.com/watchlists.html' >My Watchlists</a></li><li ><a href='http://www.hedge-professionals.com/my-searches.html' >My Searches</a></li><li ><a href='http://www.hedge-professionals.com/my-profile.html' >My Profile</a></li></ul>               
        </div><!-- /navigation -->

    </div><!-- /wrapper -->     
</div><!-- /masthead -->


<div id="content">
    <div class="wrapper">
        <div id="main-content">

 <!-- per Project stuff -->
    <span class="section">
                <img src="http://www.hedge-professionals.com/images/people/noimage_53x53.jpg" alt="Christian Sieling" width="52" height="53" class="profile-pic" id="profile-pic-104947"/>
                <h1><span id="profile-name-104947" >Christian Sieling</span></h1>
                                    <ul class="gbutton-group right">
                    <li><a class="gbutton bold pill" href="http://www.hedge-professionals.com/people.html">&laquo; Back </a></li>
                    <li><a class="gbutton bold pill boxy on-click" href="http://www.hedge-professionals.com/addtoWatchlist.php?usr=114752"  id="row-104947" title='Add to Watchlist' >Add to Watchlist</a></li>
                </ul>

                <div style="float:right;padding:3px 3px;text-align:center;margin-top:5px;" >
                <span id="profile-updated-date" >Updated On: 4 Aug, 2010</span><br/>
                <a class="gbutton bold pill" href="http://www.hedge-professionals.com/profile/suggest/people/104947/Christian-Sieling" style="margin:5px;" title='Report Inaccurate Data' >Report Inaccurate Data</a>
                </div>
                                    <h2><span id="profile-details-104947" > at <a href="http://www.hedge-professionals.com/quicksearch/search/Lumix+Capital+Management+Ltd." ><span title='Lumix Capital Management Ltd.' >Lumix Capital Management Ltd.</span></a></span><input type="hidden" name="sub-id" id="sub-id" value="114752"></h2>

            </span>

            <table width="100%" border="0" cellspacing="0" cellpadding="0" id="profile-table">
                                                        <tr>
                    <th>Role</th>
                    <td>
                    <p>Other</p>                            </td>
                </tr>
                <tr>  
                    <th>Organisation Type</th>
                    <td>
                    <p>Asset Manager</p>                        </td>
                </tr>
                <tr>
                    <th>Email</th>
                    <td><a href="mailto:[email protected]" title="[email protected]" >[email protected]</a></td>
                </tr>
                <tr>
                    <th>Website</th>
                    <td><a href="http://www.lumixcapital.com/" target="_new" title="http://www.lumixcapital.com/" >http://www.lumixcapital.com/</a></td>
                </tr>
                <tr>
                    <th>Phone</th>
                    <td>41 78 616 7334</td>
                </tr>
                <tr>
                    <th>Fax</th>
                    <td></td> 
                </tr>
                <tr>
                    <th>Mailing Address</th>
                    <td>Birrenstrasse 30</td>
                </tr>
                <tr>
                    <th>City</th>
                    <td>Schindellegi</td>
                </tr>
                <tr>
                    <th>State</th>
                    <td>CH</td>
                </tr>
                <tr>
                    <th>Country</th>
                    <td>Switzerland</td>
                </tr>
                <tr>
                    <th class="lastrow" >Zip/ Postal Code</th>
                    <td class="lastrow" >8834</td>
                </tr>
        </table>
                </div><!-- /main-content -->
                    <div id="sidebar"  >
                    </div>

            <div id="similar_sidebar" class="similar_refine" >



            </div>
                            </div><!-- /wrapper -->
</div><!-- /content -->

<div id="footer">

</div>

My AppleScript attempt that is using text item delimiters to extract the table in a similar fashion:

set p to input
set ex to extractBetween(p, "<table>", "</table>") -- extract the URL
to extractBetween(SearchText, startText, endText)
set tid to AppleScript's text item delimiters
set AppleScript's text item delimiters to startText
set endItems to text of text item -1 of SearchText
set AppleScript's text item delimiters to endText
set beginningToEnd to text of text item 1 of endItems
set AppleScript's text item delimiters to tid
return beginningToEnd
end extractBetween

How can I parse the table from the HTML file?


Solution

  • You're really close. The problem is your startText variable. The starting table tag is not in the html text so it can't be found. The line that starts the table is actually...

    <table width="100%" border="0" cellspacing="0" cellpadding="0" id="profile-table">
    

    So I modified your code to look for that tag in 2 steps. First...

    <table
    

    And then this separately...

    >
    

    In this way we can ignore all of the code that comes with the table tag (width, border etc.) because I assume it will vary between the files. After doing this we get only the code of the table. Try this...

    set p to input
    set ex to extractBetween(p, "<table", ">", "</table>")
    
    to extractBetween(SearchText, startText1, startText2, endText)
        set tid to AppleScript's text item delimiters
        set AppleScript's text item delimiters to startText1
        set endItems to text item -1 of SearchText
        set AppleScript's text item delimiters to endText
        set beginningToEnd to text item 1 of endItems
        set AppleScript's text item delimiters to startText2
        set finalText to (text items 2 thru -1 of beginningToEnd) as text
        set AppleScript's text item delimiters to tid
        return finalText
    end extractBetween