I googled half of internet searching help in my case.
So, what I need:
I have HTML structure for parsing like that:
<div class="foo">
<div class='bar' dir='ltr'>
<div id='p1' class='par'>
<p class='sb'>
<span id='dc_1_1' class='dx'>
<a href='/bar32560'>1</a>
</span>
Neque porro
<a href='/xyz' class='mr'>+</a>
quisquam est
<a href='/xyz' class='mr'>+</a>
qui.
</p>
</div>
<div id='p2' class='par'>
<p class='sb'>
<span id='dc_1_2' class='dx'>
<a href='/foo12356'>2</a>
</span>
dolorem ipsum
<a href='/xyz' class='mr'>+</a>
quia dolor sit amet,
<a href='/xyz' class='mr'>+</a>
consectetur, adipisci velit.
</p>
</div>
<div id='p3' class='par'>
<p class='sb'>
<span id='dc_1_3' class='dx'>
<a href='/foobar4586'>3</a>
</span>
Neque porro quisquam
<a href='/xyz' class='mr'>+</a>
est qui dolorem ipsum quia dolor sit
<a href='/xyz' class='mr'>+</a>
amet, t.
<a href='/xyz' class='mr'>+</a>
<span id='dc_1_4' class='dx'>
<a href='/barefoot4135'>4</a>
</span>
consectetur,
<a href='/xyz' class='mr'>+</a>
adipisci veli.
<span id='dc_1_5' class='dx'>
<a href='/barfoo05123'>5</a>
</span>
Neque porro
<a href='/xyz' class='mr'>+</a>
quisquam est
<a href='/xyz' class='mr'>+</a>
qui.
</p>
</div>
</div>
</div>
What I need (IN ENGLISH): scrape each paragraph BUT I need final scraped text object content in form:
scraped_body 1 => 1 Neque porro quisquam est qui.
scraped_body 2 => 2 dolorem ipsum quia dolor sit amet, consectetur, adipisci velit
scraped_body 3 => 3 Neque porro quisquam est qui dolorem ipsum quia dolor sit amet, t.
scraped_body 4 => 4 consectetur, adipisci veli.
scraped_body 5 => 5 Neque porro quisquam est qui.
Code what i use for now:
page = Nokogiri::HTML(open(url))
x = page.css('.mr').remove
x.xpath("//div[contains(@class, 'par')]").map do |node|
body = node.text
end
My result is like:
scraped_body 1 => 1 Neque porro quisquam est qui.
scraped_body 2 => 2 dolorem ipsum quia dolor sit amet, consectetur, adipisci velit
scraped_body 3 => 3 Neque porro quisquam est qui dolorem ipsum quia dolor sit amet, t. 4 consectetur, adipisci veli. 5 Neque porro quisquam est qui.
So this scrape whole text from div paragraph class 'par'. I need to scrape whole text after each span with his content - numbers. Or cut those div's before each span.
I need something like:
SPAN.text + P.text - a.mr
I dunno… how to do this
Please help me with this parsing. I need scrape after/before each span - I guess.
Please help, I've tried everything what i found.
EDIT DUCK @Duck1337:
I use followed code:
def verses
page = Nokogiri::HTML(open(url))
i=0
x = page.css("p").text.gsub("+", " ").split.join(" ").gsub(". ", ". HAM").split(" HAM").map do |node|
i+=1
body = node
VerseSource.new(body, book_num, number, i)
end
end
I need this because I parse a big website with text. There is few more methods. So my final output looks like:
Saved record with: book: 1, chapter: 1, verse: 1, body: 1 Neque porro quisquam est qui.
But if I have single werse with multiple sentences then your code split it by every sentence. So this is to much split.
For example:
<div id='p1' class='par'>
<p class='sb'>
<span id='dc_1_3' class='dx'>
<a href='/foobar4586'>1</a>
</span>
Neque porro quisquam. Est qui dolorem
<a href='/xyz' class='mr'>+</a>
<span id='dc_1_3' class='dx'>
<a href='/foobar4586'>2</a>
</span>
est qui dolorem ipsum quia dolor sit.
<a href='/xyz' class='mr'>+</a>
amet, t.
Your code split like that:
Saved record with: book: 1, chapter: 1, verse: 1, body: 1 Neque porro quisquam.
Saved record with: book: 1, chapter: 1, verse: 2, body: Est qui dolorem
Saved record with: book: 1, chapter: 1, verse: 3, body: 2 est qui dolorem ipsum quia dolor sit.
Hope you what I mean. Really BIG Thanks to you for that. If you can modify this it will be great!
EDIT: @KARDEIZ
Thanks for answer! When I use your code inside of my method: It parsed really radom stuff.
def verses
page = Nokogiri::HTML(open(url))
i=0
#page.css(".mr").remove
page.xpath("//div[contains(@class, 'par')]//span").map do |node|
node.content.strip.tap do |out|
while nn = node.next
break if nn.name == 'span'
out << ' ' << nn.content.strip if nn.text? && !nn.content.strip.empty?
node = nn
end
end
i+=1
body = node
VerseSource.new(body, book_num, number, i)
end
end
The output is like:
Saved record with: book: 1, chapter: 1, verse: 1, body: <here is last part of last sentence in first paragraph after "+" sign(href) and before last "+"(href)>
Saved record with: book: 1, chapter: 1, verse: 2, body: <here is last part of last sentence in second paragraph after "+" sign(href) and before last "+"(href)>
Saved record with: book: 1, chapter: 1, verse: 3, body:
Saved record with: book: 1, chapter: 1, verse: 4, body:
Saved record with: book: 1, chapter: 1, verse: 5, body: <here is last sentence in third paragraph. It is after last "+" in this paragraph and have no more "+" signs(href)
As you can see, I dunno how it make such a mess ;] Can you do something more with that? Thanks a lot!
Regards!
Try something like:
x.xpath("//div[contains(@class, 'par')]//span").map do |node|
out = node.content.strip
if following = node.at_xpath('following-sibling::text()')
out << ' ' << following.content.strip
end
out
end
The following-sibling::text()
XPATH will get the first text node after the span.
EDIT
I think this does what you want:
html.xpath("//div[contains(@class, 'par')]//span").map do |node|
node.content.strip.tap do |out|
while nn = node.next
break if nn.name == 'span'
out << ' ' << nn.content.strip if nn.text? && !nn.content.strip.empty?
node = nn
end
end
end
outputs:
[
"1 Neque porro quisquam est qui.",
"2 dolorem ipsum quia dolor sit amet, consectetur, adipisci velit.",
"3 Neque porro quisquam est qui dolorem ipsum quia dolor sit amet, t.",
"4 consectetur, adipisci veli.",
"5 Neque porro quisquam est qui."
]
It's also possible to do this with pure XPath (see XPath axis, get all following nodes until), but this solution is more simple from a coding perspective.
EDIT 2
Try this:
def verses
page = Nokogiri::HTML(open(url))
i=0
page.xpath("//div[contains(@class, 'par')]//span").map do |node|
body = node.content.strip.tap do |out|
while nn = node.next
break if nn.name == 'span'
out << ' ' << nn.content.strip if nn.text? && !nn.content.strip.empty?
node = nn
end
end
i+=1
VerseSource.new(body, book_num, number, i)
end
end