Search code examples
javascriptjsonyahooyql

Extract HTML content using YQL?


Let say I want to extract data from a web page with the following markup:

<table>
  <tr>
    <td><a href="Link 1">Column 1 Text</a></td>
    <td>Column 2 Text</td>
    <td>Column 3 Text</td>
  </tr>
  <tr>
    <td><a href="Link 2">Column 1 Text</a></td>
    <td>Column 2 Text</td>
    <td>Column 3 Text</td>
  </tr>
  ...
</table>

to JSON format :

[
  {
    link: 'Link 1',
    text: 'Column 1 Text',
    data: 'Column 3 Text'
  },
  {
    link: 'Link 2',
    text: 'Column 1 Text',
    data: 'Column 3 Text'
  }
]

Can we make it with YQL? If yes then please give me an example query.

Any helps would be appreciated!


Solution

  • Here's a query that's a good starting point, using the HTML table along with some XPath query (see Extracting HTML Content With XPath for more details on this technique):

    select * from html where url="http://cantoni.org/test/table.html" and xpath='//table/tr'

    Which produces JSON results like this:

    {
     "query": {
      "count": 2,
      "created": "2012-01-06T20:16:46Z",
      "lang": "en-US",
      "results": {
       "tr": [
        {
         "td": [
          {
           "a": {
            "href": "Link%201",
            "content": "Column 1 Text"
           }
          },
          {
           "p": "Column 2 Text"
          },
          {
           "p": "Column 3 Text"
          }
         ]
        },
        {
         "td": [
          {
           "a": {
            "href": "Link%202",
            "content": "Column 1 Text"
           }
          },
          {
           "p": "Column 2 Text"
          },
          {
           "p": "Column 3 Text"
          }
         ]
        }
       ]
      }
     }
    }