Search code examples
javascriptc#azure-functionsazure-logic-apps

Create Azure Function to Convert HTML Table to JSON


Two years ago I asked the question

How do I convert an HTML table into JSON in Logic Apps

@Skin provided the answer and it continues to work flawlessly.

Recently however, I started a project to extract table data from some webpages. I created a new function and copied the code from the from the first function. The code that @Skin provided was looking for span elements with an attribute of class="value" in order to identify and spit the data.

I was able to modify the code to work with one of the tables I need to convert by removing one of the variables and changing the line var values = xmlRow.SelectNodes(".//span[@class]"); to var values = xmlRow.SelectNodes(".//td");.

This worked for the table since it only had two values for each data set, Name and Start Date. Using a Compose Action I reworked the table to be very simple:

<table>
   <tbody>
      <tr>
         <td>John Dodo 1</td>
         <td>Some Date 1</td>
      </tr>
      <tr>
         <td>John Dodo 2</td>
         <td>Some Date 2</td>
      </tr>
      <tr>
         <td>John Dodo 3</td>
         <td>Some Date 3</td>
      </tr>
   </tbody>
</table>

The returned JSON is

  {
    "name": "John Dodo 1",
    "startDate": "Some Date 1"
  },
  {
    "name": "John Dodo 2",
    "startDate": "Some Date 2"
  },
  {
    "name": "John Dodo 3",
    "startDate": "Some Date 3"
  }
]

However, When a table has three values like Name, Company, Address, and I add the appropriate variable to the code, I get a server error and I can't figure out why. I could try and modify the table to add span elements class="value" attributes to match the table that @Skin helped with, but I would like to understand why its won't work.

Here is the modified code that won't work:

#r "Newtonsoft.Json"

using System.Net;
using Microsoft.AspNetCore.Mvc;
using Microsoft.Extensions.Primitives;
using System.Collections.Generic;
using Newtonsoft.Json;
using System.Xml;

public class TableData
{
    public string PmRep { get; set; }
    public string PmComp { get; set; }
    public string PmAddr { get; set; }
}

public static async Task<IActionResult> Run(HttpRequest req, ILogger log)
{
    var outputTable = new List<TableData>();

    string requestBody = String.Empty;

    using (StreamReader streamReader =  new  StreamReader(req.Body))
    {
        requestBody = await streamReader.ReadToEndAsync();
    }

    dynamic data = JsonConvert.DeserializeObject(requestBody);
    string xmlString = System.Text.Encoding.UTF8.GetString(Convert.FromBase64String((string)data?.Content));;

    var xmlDocument = new XmlDocument();
    xmlDocument.LoadXml(xmlString);

    var xmlRows = xmlDocument.DocumentElement.SelectNodes("//tr");

    foreach (XmlNode xmlRow in xmlRows)
    {                
        var values = xmlRow.SelectNodes(".//td");

        if (values.Count > 0)
        {
            var row = new TableData()
            {
                PmRep = values[0].InnerText,
                PmComp = values[1].InnerText
                PmAddr = values[2].InnerText
            };

            outputTable.Add(row);
        }
    }

    return new OkObjectResult(outputTable);
}

and here is the sample table:

<table>
   <tbody>
      <tr>
         <td>Name 1</td>
         <td>Comp 1</td>
         <td>Address 1</td>
      </tr>
      <tr>
         <td>Name 2</td>
         <td>Comp 2</td>
         <td>Address 2</td>
      </tr>
      <tr>
         <td>Name 3</td>
         <td>Comp 3</td>
         <td>Address 3</td>
      </tr>
   </tbody>
</table>

Solution

  • You can simply use Azure Logic Apps itself as an alternative to azure function and, by using integration account and inline java script action:

    Input:

    <table>
       <tbody>
          <tr>
             <td>Name</td>
             <td>Company</td>
             <td>Address</td>
          </tr>
          <tr>
             <td>Riithwik</td>
             <td>TextCom</td>
             <td>Address x</td>
          </tr>
          <tr>
             <td>Bojja</td>
             <td>Test Com 2</td>
             <td>Address y</td>
          </tr>
          <tr>
             <td>Chotu</td>
             <td>Test Com 3</td>
             <td>Address z</td>
          </tr>
       </tbody>
    </table>
    

    Design:

    enter image description here

    Then:

    enter image description here

    Inline Script:

    var rithinput = workflowContext.actions.Compose.outputs;
    var rithtest = new RegExp("<tr>(.*?)</tr>", "gs");
    var rithm = rithinput.match(rithtest);
    var rithout = [];
    var headerRowMatches = rithm[0].match(/<td>(.*?)<\/td>/gs);
    var columnHeaders = headerRowMatches.map(cell => cell.replace(/<\/?td>/g, '').trim());
    for (var i = 1; i < rithm.length; i++) {
        var row = rithm[i];
        var rb = row.match(/<td>(.*?)<\/td>/gs);
        var rowData1 = {};
        for (var j = 0; j < rb.length; j++) {
             var cellData1 = rb[j].replace(/<\/?td>/g, '').trim();
             rowData1[columnHeaders[j]] = cellData1;
            }
            rithout.push(rowData1);
       }
       return rithout;
    

    Output:

    enter image description here

    enter image description here

    This design works with 2 or 3 or n columns in a row, first row should be column names in input.