Search code examples
c#puppeteer-sharp

Inner Text of element returns nothing, although html shows there is content


I've been working on a scraping project for some time, and everything has been going perfect until this moment. What happens is that I'm trying to scrape the inner text of these data cells.

Initial Code before Debugging:

var characterPageHandle = await page.QuerySelectorAllWithLogAsync("#characters > table > tbody > tr > td:nth-child(2) > table > tbody > tr > td", Logger); // Extension to IPage to log what's going on with queries, but still returns the same
var numPages = await characterPageHandle[^2].GetInnerTextAsync<int>(); // Only want the 2nd to last one to know how many pages to iterate through

// Additionally, 'GetInnerTextAsync<T>()' is another extension to IElementHandle that does 'GetPropertyAsync("innerText")' and 'JsonValueAsync<T>' together (cleaner look)

Exception Thrown from this:

System.FormatException: The input string '' was not in a correct format.

As you can see, I'm able to select the table cells perfectly fine, but cannot properly get the inner text once again. Here's what the HTML looks like and what I've done to debug.

HTML Code:

<td>Prev</td>
<td class="checked" style="display: table-cell;">1</td>
<td style="display: table-cell;">2</td>
<td style="display: table-cell;">3</td>
<td style="display: table-cell;">4</td>
<td style="display: table-cell;">5</td>
<td style="display: table-cell;">...</td>
<td style="display: table-cell;">8</td>
<td>Next</td>

First Approach:

var characterPageHandle = await page.QuerySelectorAllWithLogAsync("#characters > table > tbody > tr > td:nth-child(2) > table > tbody > tr > td", Logger);

foreach (var characterPage in characterPageHandle)
    Logger.LogInformation("Inner: {Inner}", await characterPage.GetInnerTextAsync());

Second Approach:

var characterPageHandle = await page.QuerySelectorAllWithLogAsync("#characters > table > tbody > tr > td:nth-child(2) > table > tbody > tr > td", Logger);

foreach (var characterPage in characterPageHandle)
    Logger.LogInformation("Inner: {Inner}", (await characterPage.GetPropertyAsync("innerText")).RemoteObject.Value.ToString());

Third Approach:

var characterPageHandle = await page.QuerySelectorAllAsync("#characters > table > tbody > tr > td:nth-child(2) > table > tbody > tr > td");

foreach (var characterPage in characterPageHandle)
    Logger.LogInformation("Inner: {Inner}", await characterPage.GetPropertyValueAsync("textContent"));

All three log the same:

Inner: Prev
Inner: 
Inner: 
Inner: 
Inner: 
Inner: 
Inner: 
Inner: 
Inner: Next

The very last debug I did before coming here was to log all properties in the table cells, but to no avail.

Code:

foreach (var characterPage in characterPageHandle)
{
    foreach (var property in await characterPage.GetPropertiesAsync())
        Logger.LogInformation("Property: {@Property}", await property.Value.JsonValueAsync());
            
    Logger.LogInformation(" ");
}

Output:

Property: [[[[[[[[]], [[]], [[]], [[[[]]]], [[]], [[]]]]]]], [[]]]

Property: [[[[[[[[]], [[]], [[]], [[[[]]]], [[]], [[]]]]]]], [[]]]

Property: [[[[[[[[]], [[]], [[]], [[[[]]]], [[]], [[]]]]]]], [[]]]

Property: [[[[[[[[]], [[]], [[]], [[[[]]]], [[]], [[]]]]]]], [[]]]

Property: [[[[[[[[]], [[]], [[]], [[[[]]]], [[]], [[]]]]]]], [[]]]

Property: [[[[[[[[]], [[]], [[]], [[[[]]]], [[]], [[]]]]]]], [[]]]

Property: [[[[[[[[]], [[]], [[]], [[[[]]]], [[]], [[]]]]]]], [[]]]

Property: [[[[[[[[]], [[]], [[]], [[[[]]]], [[]], [[]]]]]]], [[]]]

Property: [[[[[[[[]], [[]], [[]], [[[[]]]], [[]], [[]]]]]]], [[]]]

After that, I decided to come here and seek for help since I've never had this before with PuppeteerSharp or any other experience with HTML and js. Any help would be appreciated. Thanks!


Solution

  • I figured it out myself.

    So, while the html on the website was displaying inner text, I had to use IPage.SelectAsync() on a specific dropdown that controls this part of the html, followed by IPage.WaitForSelectorAsync(). To further elaborate, the dropdown I'm referring to controls how many characters display on one page, and the html you saw was on how many pages there are to display all of them. So, there are 72 characters and the default displays 10 at a time. And of course, the 2nd to last td in the html was 8 because of 7 pages with 10 characters and the 8th has 2. The highest setting is 100, meaning that there will only be one page.

    Now, when I did select the dropdown element, I was able to finally retrieve the inner text.

    Added Code:

    await page.SelectAsync("#characters > table > tbody > tr > td:nth-child(1) > select", "10");
    await page.WaitForSelectorAsync("#characters > table > tbody > tr > td:nth-child(2) > table > tbody > tr > td");
    
    // Will go for IPage.WaitForExpressionAsync() later
    

    And logging works!

    Inner: Prev
    Inner: 1
    Inner: 2
    Inner: 3
    Inner: 4
    Inner: 5
    Inner: ...
    Inner: 8
    Inner: Next
    

    However, when I changed 10 to 100, the same amount of cells did return, but the inner text was 0, as shown here:

    Inner: Prev
    Inner: 1
    Inner: 0
    Inner: 0
    Inner: 0
    Inner: 0
    Inner: 0
    Inner: 0
    Inner: Next
    

    Final Code:

    await page.SelectAsync("#characters > table > tbody > tr > td:nth-child(1) > select", "100");
    await page.WaitForSelectorAsync("#characters > table > tbody > tr > td:nth-child(2) > table > tbody > tr > td");
    
    // https://stackoverflow.com/a/68202534/22546586 (ty for System.Linq.Async)
    var characterPages = await (await (await page.QuerySelectorAllWithLogAsync("#characters > table > tbody > tr > td:nth-child(2) > table > tbody > tr > td", Logger))
        .ToAsyncEnumerable()
        .WhereAwait(async h => await h.GetInnerTextAsync() is not "0")
        .ToArrayAsync())[^2] // All we need is the 2nd to last to know how many pages to iterate
        .GetInnerTextAsync<int>();