Search code examples
ruby-on-railsrubyparsingweb-scrapingstring-parsing

Ruby pluck second number from this scraped HTML (wombat)


Here's a section of HTML I'm trying to pull some info from:

<div class="pagination">
  <p>
    <span>Showing</span>
    1-30
    of 3744
    <span>results</span>
  </p>
</div>

I just want to store 3744 from the bit I pull (everything inside the <p>), but I'm having a hard time since the of 3744 doesn't have any CSS styling and I don't understand XPaths at all :)

<span>Showing</span>1-30\nof 3744<span>results</span>

How would you parse the above string to only retrieve the total number of results?


Solution

  • As long as it always looks the same you could also use #scan to get just the last number.

    str = '<div class="pagination">
               <p>
                 <span>Showing</span>
                    1-30
                    of 3744
                 <span>results</span>
               </p>
           </div>'
    str.scan(/\d+/).pop.to_i
    #=> 3744
    

    Update Explanation of how it works

    The scan will pull an Array of all the numbers e.g. ["1","30","3744"] then it will pop the last element from the Array "3744" and then convert that to an integer 3744.

    Please note that if the number you want is not the last element in the Array then this will not work as you want e.g.

    str = '<div class="pagination">
               <p>
                 <span>Showing</span>
                    1-30
                    of 3744
                 <span>results 14</span>
               </p>
           </div>'
    str.scan(/\d+/).pop.to_i
    #=> 14
    

    As you can see since I added the number 14 to the results span this is now the last number in the Array and your results are off. So you could modify it to something like this:

     str.gsub(/\s+/,'').scan(/\d+-\d+of(\d+)/).flatten.pop.to_i
     #=> 3744
    

    What this will do is remove all spaces with gsub then look for a pattern that equates to something along the lines of #{1,}-#{1,}of#{1,} and capture the last group #=> [["3744"]] then flatten the Array #=> ["3744"] then pop and convert to Integer. This seems like a better solution as it will make sure to match the "of ####" section everytime.