Search code examples
rubyruby-on-rails-4web-crawlernokogiri

Data scraping with Nokokiri and Pismo


I'm working in a small app to save bookmarks. I use Nokogiri and Pismo (separately) to crawl a webpage to get the title tag.

Nokogiri doesn't save Japanese, Chinese, Russian or any language with unusual characters, Pismo in the other hand saves this characters from this languages, but it's a little slow and it doesn't save title information as well as Nokogiri.

Could anyone body recommend a better gem or a better way to save that data?

doc = Nokogiri::HTML(open(bookmark_params[:link]))

@bookmark = current_user.bookmarks.build(bookmark_params)
@bookmark.title = doc.title.to_s

this is what I mean by "weird characters"

if I use nokogiri in the link below to scan for the page title

youtube.com/watch?v=QXAwnMxlE2Q
this is what I get.

NTV interview foreigners in Japan æ¥ãã¬å¤äººè¡é ­ã¤ã³ã¿ãã¥ã¼ Eng...

but using pismo gem this is what I get.

NTV interview foreigners in Japan 日テレ外人街頭インタビュー English Subtitles 英語字幕

which is the actual result I want. but the gem is a bit slower.


Solution

  • See Phrogz answer here: Nokogiri, open-uri, and Unicode Characters which I think correctly describes what is happening for you. In summary, for some reason there is an issue passing the IO object created by open-url to nokogiri. Instead read the document in as a string and give that to Nokogiri, i.e.:

    require 'nokogiri'
    require 'open-uri'
    
    open("https://www.youtube.com/watch?v=QXAwnMxlE2Q") {|f|
      p f.content_type     # "text/html"
      p f.charset          # "UTF-8"
      p f.content_encoding # []
    }
    
    doc = Nokogiri::HTML(open("https://www.youtube.com/watch?v=QXAwnMxlE2Q"))
    puts doc.title.to_s # =>  NTV interview foreigners in Japan æ¥ãã¬å¤äººè¡é ­ã¤ã³ã¿ãã¥ã¼ English Subtitles è±èªå­å¹ - YouTube
    
    
    doc = Nokogiri::HTML(open("https://www.youtube.com/watch?v=QXAwnMxlE2Q").read)
    puts doc.title.to_s # => NTV interview foreigners in Japan 日テレ外人街頭インタビュー English Subtitles 英語字幕 - YouTube
    

    If you know the content is always going to be UTF-8 you could of course to this:

    doc = Nokogiri::HTML(open("https://www.youtube.com/watch?v=QXAwnMxlE2Q"), nil, "UTF-8")