Search code examples
rubywhitespacestrip

ruby incorrect method behavior (possible depending charset)


I got weird behavior from ruby (in irb):

irb(main):002:0> pp "    LS 600"
"\302\240\302\240\302\240\302\240LS 600"

irb(main):003:0> pp "    LS 600".strip
"\302\240\302\240\302\240\302\240LS 600"

That means (for those, who don't understand) that strip method does not affect this string at all, same with gsub('/\s+/', '')

How can I strip that string (I got it while parsing Internet page)?


Solution

  • The string "\302\240" is a UTF-8 encoded string (C2 A0) for Unicode code point A0, which represents a non breaking space character. There are many other Unicode space characters. Unfortunately the String#strip method removes none of these.

    If you use Ruby 1.9.2, then you can solve this in the following way:

    # Ruby 1.9.2 only.
    # Remove any whitespace-like characters from beginning/end.
    "\302\240\302\240LS 600".gsub(/^\p{Space}+|\p{Space}+$/, "")
    

    In Ruby 1.8.7 support for Unicode is not as good. You might be successful if you can depend on Rails's ActiveSupport::Multibyte. This has the advantage of getting a working strip method for free. Install ActiveSupport with gem install activesupport and then try this:

    # Ruby 1.8.7/1.9.2.
    $KCODE = "u"
    require "rubygems"
    require "active_support/core_ext/string/multibyte"
    
    # Remove any whitespace-like characters from beginning/end.
    "\302\240\302\240LS 600".mb_chars.strip.to_s