Search code examples
c#html-agility-pack

Extract data from a table with HtmlAgilityPack


I master HtmlAgilityPack. I'm trying to get data from a pre-loaded page. Namely: There is a page 1.htm. I want to get the value from the table opposite the line "Operating system". (the document itself is attached). I do this:

private void simpleButton1_Click(object sender, EventArgs e)
        {
            // Создаю экземпляр класса
            HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
            // Загружаю файл
            doc.Load(@"D:\(тут путь к файлу)\1.htm");
            // Пытаюсь получить информацию из ноды, но получаю null
            HtmlAgilityPack.HtmlNode bodyNode = doc.DocumentNode.SelectSingleNode("//TD[@CLASS=pt]");
            ...

In general, it is necessary to extract a lot of information from the file, but I think that if one line is obtained, then further by analogy.

The required line was as follows:

 private void simpleButton1_Click(object sender, EventArgs e)
        {
            // Создаю экземпляр класса
            HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
            // Загружаю файл
            doc.Load(@"D:\(тут путь к файлу)\1.htm");

            foreach (HtmlAgilityPack.HtmlNode node in doc.DocumentNode.SelectNodes("//body/table[2]/tr[8]/td[4]"))
            {
                string stroka = node.InnerText;
            }

But this option is "on the forehead." If you do not change the structure of my document. And how it is possible with the help of the search has not figured out yet.

File


Solution

  • This will return a dictionary of tables by name. Each table is a dictionary with first column as key and second for value.

    var tables = new Dictionary<string, Dictionary<string, string>>();
    var doc = new HtmlDocument();
    doc.Load(@"D:\(тут путь к файлу)\1.htm", Encoding.GetEncoding(1251), false);
    var tableNames = doc.DocumentNode.SelectNodes("//td[@class='pt']/a").Select(a=>a.Attributes["name"].Value);
    foreach(string name in tableNames)
    {
        var table = doc.DocumentNode.SelectSingleNode("//table[.//a[@name='" + name + "']]/following-sibling::table[1]");
        int columns = table.SelectNodes(".//tr[1]/td").Count();
    
        string[] keys = table.SelectNodes(".//tr/td["+(columns-1)+"]").Select(n => n.InnerText.Replace("&nbsp;"," ").Trim()).ToArray();
        string[] values = table.SelectNodes(".//tr/td["+columns+"]").Select(n => n.InnerText.Replace("&nbsp;"," ").Trim()).ToArray();
        var body = new Dictionary<string, string>();
        for (int i = 0; i < keys.Count(); i++)
        {
            string key = keys[i];
            if (body.ContainsKey(key))
                body[key] += ", " + values[i];
            else if( key!="" && values[i]!="")
                body[key] = values[i];
    
        }
        tables.Add(name, body);
    
    }
    

    For example tables["power management"] returns 4 entries:

    • [0] {[Текущий источник питания, Электросеть]} System.Collections.Generic.KeyValuePair
    • [1] {[Состояние батарей, Нет батареи]} System.Collections.Generic.KeyValuePair
    • [2] {[Полное время работы от батарей, Неизвестно]} System.Collections.Generic.KeyValuePair
    • [3] {[Оставшееся время работы от батарей, Неизвестно]} System.Collections.Generic.KeyValuePair

    and tables["power management"]["Текущий источник питания"] returns:

    "Электросеть"

    For iterating you can do:

    foreach(var tableName in tables.Keys)
    {
        var table = tables[tableName];
        foreach(var key in table.Keys)
        {
            string value = table[key];
            Debug.Print(tableName + "/" + key + "/" + value);
        }
    }