I try to scrape some search results from
With the commands
response.css('div.search_result_title').extract()
Which works, but as I try to remove the html tags with
response.css('div.search_result_title::text').extract()
But I keep getting, \n\n\n\n\n\n\n
[u'\n', u'\n(Dissolved)\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n']
Do you guys know why? Thanks!
Do you want to get the headers' texts? You have a
inside div
, so yes, you get a lot of empty data. Use div.search_result_title a::text
.
And for second question about get whole block's text:
for i in response.css('div.searchResult'):
print ' '.join([j.strip() for j in i.css('::text').extract() if j.strip()])