Search code examples
c#nodeshtml-agility-pack

How to get values from website using HTMLNode and HtmlAgility-Pack in C#


I'm trying to get data from this website

I want to get: Level, Vocation and Name from the table. They are located directly in tr class -> td. How can I get those informations out? This is how data looks like:

<table width="100%" class="tabi">
  <tr>
    <td colspan=7>
      Characters
    </td>
  </tr>

  <tr>
    <td height='30' style='background-color:#9f8f6d;'>
      <a href=?page=whoisonline&ord=name&sort=DESC&id=1>&#8593;Name</a>
    </td>
    <td width='240' style='background-color:#9f8f6d;'>
      <a href=?page=whoisonline&ord=voc&sort=DESC&id=1>Vocation</a>
    </td>
    <td width='120' style='background-color:#9f8f6d;'>
      <a href=?page=whoisonline&ord=lvl&sort=DESC&id=1>Level</a>
    </td>
  </tr>

  <tr class='hover'> 
    <td>
      <a href='?page=character&name=Abe' class='menulink_hs'>Abe</a>
    </td>
    <td>
      Elder Druid
    </td>
    <td>
      19
    </td>
  </tr>

Right now I'm stuck on getting this data out of tds using Nodes, with bad results. My htmlNodes is either NULL or it gives more than one Node(that I cant actually get out of it for some reason). What might be good solution to this?

My code:

var html = @"https://tibiantis.online/?page=whoisonline";
                HtmlWeb web = new HtmlWeb();
                var htmlDoc = web.Load(html);

                HtmlNode htmlNodes = htmlDoc.DocumentNode.SelectSingleNode("/html/body/div[2]/table/tbody/tr[1]/td[3]/div[2]/div[2]/table/tbody/tr[3]");
                foreach (var node in htmlNodes)
                {
                    foreach (var cell in htmlNodes.SelectNodes(".//td"))
                    {
                        listBox1.Items.Add(cell.InnerText);
                    }
                }

**I'm stuck with this .SelectNodes thing which no metter what gives me either null or too many Nodes. I tried many combinations both with .SelectSingleNode and .SelectNode **

Second thing is that I've got no clue how to get number of items that I will receive.

I was looking for the anwser on stack and google with some results, but noone of them was close to my situation


Solution

  • Try with this:

    public class Person
    {
        public string Name { get; set; }
        public string Vocation { get; set; }
        public int Level { get; set; }
    
        public static List<Person> LoadPersons(HtmlAgilityPack.HtmlDocument doc)
        {
            var persons = new List<Person>();
    
            var rowsNodes = doc.DocumentNode.SelectNodes("//table//tr[contains(@class, 'hover')]");
            if (rowsNodes == null)
            {
                return persons;
            }
    
            foreach (var rowNode in rowsNodes)
            {
                var cells = rowNode.SelectNodes(".//td");
                if (cells != null && cells.Count >= 3)
                {
                    var name = cells[0].InnerText;
                    var vocation = cells[1].InnerText;
                    var levelText = cells[2].InnerText;
    
                    if (int.TryParse(levelText, out int level))
                    {
                        persons.Add(new Person
                        {
                            Name = name,
                            Vocation = vocation,
                            Level = level
                        });
                    }
                }
            }
    
            return persons;
        }
    }
    

    This class represent a person (a row in the table) and include a method to scrap the table. When you make scraping you must try to be a bit general because putting all tags in the query makes the query to fail with a bit HTML change.

    I simply search in the document (//) a table and, inside a table (// because maybe some browsers add tbody or not automatically), select all rows (tr) with the "hover" class (your persons).

    Iterate each row getting the 3 cells texts. The last one (the level), convert to integer. And then, create the person.

    Now, you can create a class to define each item in your list. I almost always create a class to work with the class when I get an item from the ListBox (get selected item as PersonItem and do any work with it...):

    public class PersonItem
    {
        public PersonItem(Person person)
        {
            this.Person = person;
        }
    
        public Person Person { get; }
    
        public override string ToString()
        {
            return $"{this.Person.Name} ({this.Person.Level})";
        }
    }
    

    It's simply a wrapper around Person. Override ToString with the text to show in the ListBox.

    Test it:

    var web = new HtmlWeb();
    var doc = web.Load("https://tibiantis.online/?page=whoisonline");
    
    var persons = Person.LoadPersons(doc);
    foreach (var person in persons)
    {
        var item = new PersonItem(person);
        listBox1.Items.Add(item);
    }
    
    // In any moment, you may do things like this:
    var personItem = listBox1.SelectedItem as PersonItem;
    if (personItem != null)
    {
        var person = personItem.Person;
        // ...
    }