Search code examples
ruby-on-railsrubyweb-scrapingcapybarapoltergeist

How to extract text from multiple paragraph in div poltergeist/capybara


I am using this command to extract p tag

session.all('.entry p')

Giving results

[#<Capybara::Element tag="p" path="//HTML[1]/BODY[1]/DIV[3]/DIV[1]/SECTION[1]/DIV[1]/ARTICLE[1]/DIV[3]/P[1]">, 
#<Capybara::Element tag="p" path="//HTML[1]/BODY[1]/DIV[3]/DIV[1]/SECTION[1]/DIV[1]/ARTICLE[1]/DIV[3]/P[2]">, 
#<Capybara::Element tag="p" path="//HTML[1]/BODY[1]/DIV[3]/DIV[1]/SECTION[1]/DIV[1]/ARTICLE[1]/DIV[3]/P[3]">, 
#<Capybara::Element tag="p" path="//HTML[1]/BODY[1]/DIV[3]/DIV[1]/SECTION[1]/DIV[1]/ARTICLE[1]/DIV[3]/P[4]">,
#<Capybara::Element tag="p" path="//HTML[1]/BODY[1]/DIV[3]/DIV[1]/SECTION[1]/DIV[1]/ARTICLE[1]/DIV[3]/P[5]">, 
#<Capybara::Element tag="p" path="//HTML[1]/BODY[1]/DIV[3]/DIV[1]/SECTION[1]/DIV[1]/ARTICLE[1]/DIV[3]/P[6]">,
#<Capybara::Element tag="p" path="//HTML[1]/BODY[1]/DIV[3]/DIV[1]/SECTION[1]/DIV[1]/ARTICLE[1]/DIV[3]/P[7]">, 
#<Capybara::Element tag="p" path="//HTML[1]/BODY[1]/DIV[3]/DIV[1]/SECTION[1]/DIV[1]/ARTICLE[1]/DIV[3]/P[8]">, 
#<Capybara::Element tag="p" path="//HTML[1]/BODY[1]/DIV[3]/DIV[1]/SECTION[1]/DIV[1]/ARTICLE[1]/DIV[3]/P[9]">, 
#<Capybara::Element tag="p" path="//HTML[1]/BODY[1]/DIV[3]/DIV[1]/SECTION[1]/DIV[1]/ARTICLE[1]/DIV[3]/P[10]">]

Now i want to extract all text from p node, i know there is loop method to merge all paragraph text, is there any other way. capybara provide?


Solution

  • The result of #all is a Capybara::Result. The doc says:

    A Result represents a collection of Node::Element on the page. It is possible to interact with this collection similar to an Array because it implements Enumerable [...]

    Thus, you may interact with it as with a enumerable, it does not offer any method you're asking for.

    You can do this to retrieve the concatenated content:

    session.all('.entry p').map(&:text).join
    

    According to your tag "web-scraping" I assume you're using capybara for web scraping, and not for testing. As capybara's main purpose is for testing it does not have a built-in method for what you're asking for.

    If you're implementing a test though you should rather do something like this (I used RSpec here):

    within('.entry') do
      expect(page).to have_text 'something'
    end
    

    Or, if you really need to be very specific about the location of the spec (which in most cases is unnecessary) you should test each element on it's own:

    expect(session.all('.entry p')[0]).to have_content 'something'
    expect(session.all('.entry p')[1]).to have_content 'something else'
    

    And just as a last sidenote: For web scraping there are better options than capybara.