Search code examples
c#asp.nethtml-parsing

How to parse / get Parameter and its value from an HTML using indexof in c# code


How to programatically retrieve the substring from an HTML string in c# using indexof method. Here String HTML is the html whole content and want to retrieve the Admission date value from the parseString .Now this code is returning a wrong content from the HTML.Could someone please identify the issue in my code.

protected string ParseAdmissionDate(string Html)
{
  string parseString = "<TD style=\"HEIGHT: 5.08mm; \" class=\"a355c\"><DIV class=\"a355\">AdmissionDate</DIV></TD><TD class=\"a359c\"><DIV class=\"a359\">3/8/2021</DIV></TD>";
  int i = 0;
  i = Html.IndexOf(parseString, 0, Html.Length);

  if (i > 0)
  {
    i += parseString.Length;
    int end = Html.IndexOf("</TD>", i, (Html.Length - i));

    return Html.Substring(i, end - i);
  }
  else
    return null;
}

Solution

  • You should consider using a library like HtmlAgilityPack or to do web scraping.

    If you really want to use IndexOf (for unknown reasons) you have to remember that 0 is a valid result (meaning you found the substring on the index 0) and it will be something like

    public static string ParseAdmissionDate(string Html)
    {
      //html contains approximately
      //<TD style=\"HEIGHT: 5.08mm; \" class=\"a355c\"><DIV class=\"a355\">AdmissionDate</DIV></TD><TD class=\"a359c\"><DIV class=\"a359\">3/8/2021</DIV></TD>
    
      //Find Div of the AdmissionDate
      var searchPattern = ">AdmissionDate</DIV>";
      var searchIndex = Html.IndexOf(searchPattern, StringComparison.InvariantCultureIgnoreCase);
      if(searchIndex < 0) return null;
    
      //Get the string that is after the searchString
      var stringAfterSearchPattern = Html.Substring(searchIndex + searchPattern.Length);
    
      //Get the next close div after the searchString
      var endIndex = stringAfterSearchPattern.IndexOf("</DIV>", StringComparison.InvariantCultureIgnoreCase);
      if(endIndex < 0) return null;
      
      //Index of the opening div
      var startValueIndex = stringAfterSearchPattern.Substring(0, endIndex).LastIndexOf(">");
      if(startValueIndex < 0) return null;
    
      return stringAfterSearchPattern.Substring(startValueIndex + 1, endIndex - startValueIndex - 1);
    }
    

    The thing is that if the html is slightly changed, for example if the AdmissionDate is not inside a div (something like "<td>AdmissionDate</td>") the method will fail. Hence my indication of a web scraping library.